发布于2016年08月24日

path.data 是可以配置多个路径(用逗号隔开),以下的问题的前提是配置的路径分别对应到一块磁盘。
1、如果配置多个路径,es存储索引的时候 是一个磁盘满了 再去存另外一个吗?
2、如果一个索引跨了两个磁盘,搜索的时候,es需要去两块磁盘上分别做搜索,然后结果做merge吗? 这样效率是不是还没有一块磁盘来的好?

public static ShardPath selectNewPathForShard(NodeEnvironment env, ShardId shardId, IndexSettings indexSettings,
long avgShardSizeInBytes, Map<Path,Integer> dataPathToShardCount) throws IOException {

final Path dataPath;
final Path statePath;

if (indexSettings.hasCustomDataPath()) {
dataPath = env.resolveCustomLocation(indexSettings, shardId);
statePath = env.nodePaths()[0].resolve(shardId);
} else {
BigInteger totFreeSpace = BigInteger.ZERO;
for (NodeEnvironment.NodePath nodePath : env.nodePaths()) {
totFreeSpace = totFreeSpace.add(BigInteger.valueOf(nodePath.fileStore.getUsableSpace()));

// TODO: this is a hack!! We should instead keep track of incoming (relocated) shards since we know
// how large they will be once they're done copying, instead of a silly guess for such cases:

// Very rough heuristic of how much dtisk space we expec the shard will use over its lifetime, the max of current average
// shard size across the cluster and 5% of the total available free space on this node:
BigInteger estShardSizeInBytes = BigInteger.valueOf(avgShardSizeInBytes).max(totFreeSpace.divide(BigInteger.valueOf(20)));

// TODO - do we need something more extensible? Yet, this does the job for now...
final NodeEnvironment.NodePath[] paths = env.nodePaths();
NodeEnvironment.NodePath bestPath = null;
BigInteger maxUsableBytes = BigInteger.valueOf(Long.MIN_VALUE);
for (NodeEnvironment.NodePath nodePath : paths) {
FileStore fileStore = nodePath.fileStore;

BigInteger usableBytes = BigInteger.valueOf(fileStore.getUsableSpace());
assert usableBytes.compareTo(BigInteger.ZERO) >= 0;

// Deduct estimated reserved bytes from usable space:
Integer count = dataPathToShardCount.get(nodePath.path);
if (count != null) {
usableBytes = usableBytes.subtract(estShardSizeInBytes.multiply(BigInteger.valueOf(count)));
if (bestPath == null || usableBytes.compareTo(maxUsableBytes) > 0) {
maxUsableBytes = usableBytes;
bestPath = nodePath;

statePath = bestPath.resolve(shardId);
dataPath = statePath;
return new ShardPath(indexSettings.hasCustomDataPath(), dataPath, statePath, shardId);

Multiple path.data

Using multiple IO devices (by specifying multiple path.data paths) to hold the shards on your node is useful for increasing total storage space, and improving IO performance, if that's a bottleneck for your Elasticsearch usage.

With 2.0, there is an important change in how Elasticsearch spreads the IO load across multiple paths: previously, each low-level index file was sent to the best (default: most empty) path, but now that switch is per-shard instead.  When a shard is allocated to the node, the node will pick which path will hold all files for that shard. 

The improves resiliency to IO device failures: if one of your IO devices crashes, now you'll only lose the shards that were on it, whereas before 2.0 you would lose any shard that had at least one file on the affected device (typically this means nearly all shards on that node).

Note that an OS-level RAID 0 device is also a poor choice as you'll lose all shards on that node when any device fails, since files are striped at the block level, so multiple path.data is the recommended approach.

You should still have at least 1 replica for your indices so you can recover the lost shards from other nodes without any data loss.

