官网一些优化索引的建议大家可以参考
a few tips how to improve indexing performance
This should be fairly obvious, but use bulk indexing requests for optimal performance. Bulk sizing is dependent on your data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if you are indexing 1,000 documents per bulk, keep the following in mind:
【每批次导入的不能以数量来定,而因以文档总物理大小来看。每一批导入的时候所有文档会加载到内存,不同的物理大小会直接影响到导入速率。】
1,000 documents at 1 KB each is 1 MB.
1,000 documents at 100 KB each is 100 MB.
Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count.
Start with a bulk size around 5–15 MB and slowly increase it until you do not see performance gains anymore. Then start increasing the concurrency of your bulk ingestion (multiple threads, and so forth).
Monitor your nodes with Marvel and/or tools such as iostat, top, and ps to see when resources start to bottleneck. If you start to receive EsRejectedExecutionException, your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes.
【建议导入的时候以物理大小5-15MB开始不断提高每批次导入量,当性能不再增加时提高线程数直至性能达到最优化。】
When ingesting data, make sure bulk requests are round-robined across all your data nodes. Do not send all requests to a single node, since that single node will need to store all the bulks in memory while processing.
【官方的优化导入建议也是“分而治之”】
3 个回复
sinomall - 会敲代码 会投三分球
赞同来自: AlixMu
解题的关键点在于“分而治之”。
将es集群搭建成10个32G的实例,官网推荐每个es实例最好不要超过32G,因为大内存GC很耗性能。
导入工具将20亿数据分成10份分别导入各个实例,这样发挥整个集群的最大导入容量能力。
官网一些优化索引的建议大家可以参考
a few tips how to improve indexing performance
This should be fairly obvious, but use bulk indexing requests for optimal performance. Bulk sizing is dependent on your data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if you are indexing 1,000 documents per bulk, keep the following in mind:
【每批次导入的不能以数量来定,而因以文档总物理大小来看。每一批导入的时候所有文档会加载到内存,不同的物理大小会直接影响到导入速率。】
1,000 documents at 1 KB each is 1 MB.
1,000 documents at 100 KB each is 100 MB.
Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count.
Start with a bulk size around 5–15 MB and slowly increase it until you do not see performance gains anymore. Then start increasing the concurrency of your bulk ingestion (multiple threads, and so forth).
Monitor your nodes with Marvel and/or tools such as iostat, top, and ps to see when resources start to bottleneck. If you start to receive EsRejectedExecutionException, your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes.
【建议导入的时候以物理大小5-15MB开始不断提高每批次导入量,当性能不再增加时提高线程数直至性能达到最优化。】
When ingesting data, make sure bulk requests are round-robined across all your data nodes. Do not send all requests to a single node, since that single node will need to store all the bulks in memory while processing.
【官方的优化导入建议也是“分而治之”】
关于存储,分片等还有些优化建议,可以具体参看链接:
https://www.elastic.co/guide/e ... .html
sinomall - 会敲代码 会投三分球
赞同来自:
请问这是什么原因
使用iftop -N -n -i eth2查看网卡流量
如下图:
sinomall - 会敲代码 会投三分球
赞同来自: