批量导入ES时，自定义索引id引发的性能问题（导入接近10亿左右的索引）

Elasticsearch | 作者 zzz | 发布于2017年09月28日 | 阅读数：13656

最近需要离线导入一批10亿左右的索引数据，通过bluk接口导入。
根据之前大批量数据导入的经验，如果是自定义的索引id，会在索引达到一定量的时候，导入性能急剧下降，这是由于所有插入的索引都会进行是否重复的检查（等一系列其他操作），这个检查在索引量巨大的时候，重复检查这个操作的消耗就非常可观了，因此之前的处理方式是将索引id设置为系统自动生成，但是这个方式算是一个治标不治本的方法。
有没有一种方式，或者什么配置，当能够确认导入数据索引是唯一的时候，可以直接插入，从而避免系统这种开销。

3 个回复

zhengtong0898

赞同来自: kennywu76

bulk 默认op_type 是index, 是会做检查的, 而且会覆盖原文档然后在version上+1.
不知道bulk的时候声明op_type是create时, 是否也会做非常多的检查.. 但至少原文档是不会覆盖, version不会+1, 而是针对这个id返回一个409错误状态码(表示文档已存在, 这也许也是做了检查判断的).

这是我测试的代码.(python)

import time

from pprint import pprint

from elasticsearch import Elasticsearch



es = Elasticsearch(hosts=["192.168.31.13"])





es.indices.delete("do_bulk")



if not es.indices.exists("do_bulk"):

    es.indices.create(

        index="do_bulk",

        body={

            "mappings": {

                "my_type": {

                    "properties": {

                        "title": {"type": "text"}

                    }

                }

            }

        }

    )

    time.sleep(5)





es.bulk(

    index="do_bulk",

    doc_type="my_type",

    body=[

        {"create": {"_id": 1}},

        {"title": "hello"},

        {"create": {"_id": 2}},

        {"title": "word"},

    ]

)

time.sleep(5)





s = es.bulk(

    index="do_bulk",

    doc_type="my_type",

    body=[

        {"create": {"_id": 1}},

        {"title": "hello1"},

        {"create": {"_id": 2}},

        {"title": "word2"},

        {"create": {"_id": 3}},

        {"title": "good3"},

        {"create": {"_id": 4}},

        {"title": "boy4"}

    ]

)

time.sleep(5)

pprint(s)



s1 = es.search(

    index="do_bulk"

)



pprint(s1)

输出结果:

{'errors': True,

 'items': [{'create': {'_id': '1',

                       '_index': 'do_bulk',

                       '_type': 'my_type',

                       'error': {'index': 'do_bulk',

                                 'index_uuid': '6FZYZIZWQJC56T5GxF5dHw',

                                 'reason': '[my_type][1]: version conflict, '

                                           'document already exists (current '

                                           'version [1])',

                                 'shard': '3',

                                 'type': 'version_conflict_engine_exception'},

                       'status': 409}},

           {'create': {'_id': '2',

                       '_index': 'do_bulk',

                       '_type': 'my_type',

                       'error': {'index': 'do_bulk',

                                 'index_uuid': '6FZYZIZWQJC56T5GxF5dHw',

                                 'reason': '[my_type][2]: version conflict, '

                                           'document already exists (current '

                                           'version [1])',

                                 'shard': '2',

                                 'type': 'version_conflict_engine_exception'},

                       'status': 409}},

           {'create': {'_id': '3',

                       '_index': 'do_bulk',

                       '_shards': {'failed': 0, 'successful': 2, 'total': 2},

                       '_type': 'my_type',

                       '_version': 1,

                       'created': True,

                       'result': 'created',

                       'status': 201}},

           {'create': {'_id': '4',

                       '_index': 'do_bulk',

                       '_shards': {'failed': 0, 'successful': 2, 'total': 2},

                       '_type': 'my_type',

                       '_version': 1,

                       'created': True,

                       'result': 'created',

                       'status': 201}}],

 'took': 46}

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},

 'hits': {'hits': [{'_id': '2',

                    '_index': 'do_bulk',

                    '_score': 1.0,

                    '_source': {'title': 'word'},

                    '_type': 'my_type'},

                   {'_id': '4',

                    '_index': 'do_bulk',

                    '_score': 1.0,

                    '_source': {'title': 'boy4'},

                    '_type': 'my_type'},

                   {'_id': '1',

                    '_index': 'do_bulk',

                    '_score': 1.0,

                    '_source': {'title': 'hello'},

                    '_type': 'my_type'},

                   {'_id': '3',

                    '_index': 'do_bulk',

                    '_score': 1.0,

                    '_source': {'title': 'good3'},

                    '_type': 'my_type'}],

          'max_score': 1.0,

          'total': 4},

 'timed_out': False,

 'took': 11}

rockybean - Elastic Certified Engineer, ElasticStack Fans，公众号：ElasticTalk

赞同来自: zzz

之前写过一篇 create index 的文章，你可以看下
https://elasticsearch.cn/article/285

你bulk的时候用的create、index还是update ????其实 create和index的性能消耗是类似的，update最大。

白衬衣 - 金桥

好像没有这种设置，有办法尝试下。
举个例子，你把10亿条数据分成10个1亿行的数据，写入10个index，然后10个index写完之后，全部reindex到一个索引里面，不知道方法是否可行。

要回复问题请先登录或注册

批量导入ES时，自定义索引id引发的性能问题（导入接近10亿左右的索引）

3 个回复

发起人

活动推荐

相关问题

问题状态

批量导入ES时，自定义索引id引发的性能问题（导入接近10亿左右的索引）

与内容相关的链接

3 个回复

发起人

活动推荐

相关问题

问题状态