es集群100亿+数据,版本1.5
现在升级为1.7,restart upgrade的升级方式。
所有primary很快恢复,大部分replicas也恢复了,只有一个replica一直是initialize状态。
情形是这样:
primary在ndb4上,首先replicas放在了ndb6上。出现了异常日志。
然后replica自动放到了 ndb7上,出现了异常日志。
。。。
------
[2015-12-16 05:42:55,400][WARN ][indices.cluster ] [ndb6] [[mgobject0][10]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [mgobject0][10]: Recovery failed from [ndb4][NJzaKTz4QWiRwf05pcjnFw][ndb4][inet[/192.168.40.34:9300]] into [ndb6][N-QZE3-ATNeaHdsnu
ovq2A][ndb6][inet[/192.168.40.36:9300]]
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:280)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:70)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:567)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [ndb4][inet[/192.168.40.34:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [mgobject0][10] Phase[1] Execution failed
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:883)
at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:780)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [mgobject0][10] Failed to transfer [819] files with total size of [808.5gb]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:431)
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:878)
... 10 more
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/data/elasticsearch/data/cerebro/nodes/0/indices/mgobject0/10/index/_4iw9.cfs")
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:189)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:160)
at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.doRun(RecoverySourceHandler.java:312)
... 4 more
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
... 6 more
现在升级为1.7,restart upgrade的升级方式。
所有primary很快恢复,大部分replicas也恢复了,只有一个replica一直是initialize状态。
情形是这样:
primary在ndb4上,首先replicas放在了ndb6上。出现了异常日志。
然后replica自动放到了 ndb7上,出现了异常日志。
。。。
------
[2015-12-16 05:42:55,400][WARN ][indices.cluster ] [ndb6] [[mgobject0][10]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [mgobject0][10]: Recovery failed from [ndb4][NJzaKTz4QWiRwf05pcjnFw][ndb4][inet[/192.168.40.34:9300]] into [ndb6][N-QZE3-ATNeaHdsnu
ovq2A][ndb6][inet[/192.168.40.36:9300]]
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:280)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:70)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:567)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [ndb4][inet[/192.168.40.34:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [mgobject0][10] Phase[1] Execution failed
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:883)
at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:780)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [mgobject0][10] Failed to transfer [819] files with total size of [808.5gb]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:431)
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:878)
... 10 more
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/data/elasticsearch/data/cerebro/nodes/0/indices/mgobject0/10/index/_4iw9.cfs")
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:189)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:160)
at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.doRun(RecoverySourceHandler.java:312)
... 4 more
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
... 6 more
2 个回复
medcl - 今晚打老虎。
赞同来自:
jingkyks - 水果铅笔2B橡皮
赞同来自:
---------
org.elasticsearch.indices.recovery.RecoveryFailedException: [mgobject0][2]: Recovery failed from [ndb7][D2VJf9kxQl6_Vma1eYRcng][ndb7][inet[/192.168.40.37:9300]] into [ndb4][NJzaKTz4QWiRwf05pcjnFw][ndb4][in et[/192.168.40.34:9300]] (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 5 more
---------
这样的分片又自动重新分配了,而且状态良好。而无法分配成功的那一个分片异常就是1楼描述的。
如果这个分片确实坏掉了,有什么办法可以修复么?lucence-repair方法?