愚者求师之过,智者从师之长。

elasticsearch-hadoopp hive导入数据到es中的总是version conflict?

Elasticsearch | 作者 Paner | 发布于2017年08月09日 | 阅读数:7020

出错信息如下:
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [10.200.5.146:9200] returned Conflict(409) - [user_type][20841213]: version conflict, current [38], provided [37]; Bailing out..
at org.elasticsearch.hadoop.rest.RestClient.retryFailedEntries(RestClient.java:207)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:170)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:225)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:248)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:201)
at org.elasticsearch.hadoop.rest.RestRepository.writeProcessedToIndex(RestRepository.java:179)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.write(EsHiveOutputFormat.java:63)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:751)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at org.apache.hadoop.hive.ql.exec.UnionOperator.process(UnionOperator.java:148)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489)
... 9 more
 
配置信息中:map分片是10g,之前猜测是多线程任务影响的,把线程数降低,还是不行;后来网上查询了一下加了一些忽略版本号的配置,任务还是容易出现版本号冲突;哪位大牛若是遇到同类,请帮忙解答一二,谢谢
 
问题补充:
索引的内容有新加和更新,因此es.write.operation选项用的upsert;昨天发现改为用默认的index,同一个任务执行就没有conflict了。下面是官方给出的两个类型的解释:
index (default)
new data is added while existing data (based on its id) is replaced (reindexed).
upsert
known as merge or insert if the data does not exist, updates if the data exists (based on its id).
看完后还是不太明白这两个之间的区别?
 
 
已邀请:

medcl - 今晚打老虎。

赞同来自:

批量导入么?也有更新操作么?
怎么导入的,完整过程描述一下。

zyb1994111

赞同来自:

通俗来讲,如果你指定了_id,index是重新导数据,覆盖原有的数据,upsert是更新原有数据,如果_id存在就更新,不存在就添加

要回复问题请先登录注册