Q:有两个人掉到陷阱里了,死的人叫死人,活人叫什么?

elasticsearch data node 脱离又加入集群, 报错: java.nio.channels.ClosedChannelException: null

Elasticsearch | 作者 WeiZhixiong | 发布于2019年09月22日 | 阅读数:7774

嗨~,我遇到了一些问题,希望能够得到帮助
elasticsearch data node 脱离后通常3分钟内又加入集群,data node 节点 cpu 和 mem 没有明显异常
elasticsearch 版本: 6.6.0
  
data node error log:
[2019-09-21T23:27:55,201][WARN ][o.e.t.TcpTransport       ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.180:57932}]
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:?]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:?]
at sun.nio.ch.IOUtil.write(IOUtil.java:51) ~[?:?]
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:405) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:938) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:360) [netty-transport-4.1.32.Final.jar:4.1.32
.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:905) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1396) [netty-transport-4.1.32.Final.jar:4.1.32.F
inal]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) [netty-transport-4.1.32.Final.jar:
4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) [netty-transport-4.1.32.Final.jar:4
.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) [netty-transport-4.1.32.Final.jar:4.1.32.
Final]
at io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) [netty-transport-4.1.32.Final.jar:
4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) [netty-transport-4.1.32.Final.jar:4
.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) [netty-transport-4.1.32.Final.jar:4.1.32.
Final]
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) [netty-transport-4.1.32.Final.jar:
4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) [netty-transport-4.1.32.Final.jar:4
.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) [netty-transport-4.1.32.Final.jar:4.1.32.
Final]
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) [netty-transport-4.1.32.Final.jar:
4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) [netty-transport-4.1.32.Final.jar:4
.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38) [netty-transport-4.1.32.Final.jar:4.
1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1152) [netty-transport-4.1.3
2.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1075) [netty-transport-4.1.32.
Final.jar:4.1.32.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.32.Final.jar:4.1.32.Final
]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) [netty-common-4.1.32.Final.jar:4.1.
32.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Fin
al]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-09-21T23:27:55,207][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.180:57932}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-09-21T23:27:55,207][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.180:57932}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-09-21T23:27:55,208][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.180:57932}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-09-21T23:27:55,211][ERROR][o.e.x.m.c.n.NodeStatsCollector] [10.205.41.183] collector [node_stats] timed out when collecting data
[2019-09-21T23:27:55,203][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.134:36114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-09-21T23:27:55,214][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.134:36114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-09-21T23:27:55,214][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.134:36114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-09-21T23:27:55,224][WARN ][o.e.t.TcpTransport ] [10.205.41.183] send message failed [channel: Netty4TcpChannel{localAddress=/10.205.4
1.183:9300, remoteAddress=/10.205.41.182:44826}]
java.nio.channels.ClosedChannelException: null


 
master node error log:
[2019-09-21T23:23:35,672][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [10.205.41.134] collector [cluster_stats] timed out when collecting data
[2019-09-21T23:23:45,817][ERROR][o.e.x.m.c.i.IndexStatsCollector] [10.205.41.134] collector [index-stats] timed out when collecting data
[2019-09-21T23:23:47,328][WARN ][o.e.c.InternalClusterInfoService] [10.205.41.134] Failed to update shard information for ClusterInfoUpdateJob wi
thin 15s timeout
[2019-09-21T23:24:05,673][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [10.205.41.134] collector [cluster_stats] timed out when collecting data
[2019-09-21T23:24:15,726][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [10.205.41.134] collector [index_recovery] timed out when collecting data
[2019-09-21T23:24:25,768][ERROR][o.e.x.m.c.i.IndexStatsCollector] [10.205.41.134] collector [index-stats] timed out when collecting data
[2019-09-21T23:24:32,329][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [10.205.41.134] failed to execute on node [m_-YY2D6T365AB2JBPhKkw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.205.41.183][10.205.41.183:9300][cluster:monitor/nodes/stats[n]] request_id [635
788802] timed out after [15010ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1011) [elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.0.jar:
6.6.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-09-21T23:24:45,675][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [10.205.41.134] collector [cluster_stats] timed out when collecting data
[2019-09-21T23:24:47,330][WARN ][o.e.c.InternalClusterInfoService] [10.205.41.134] Failed to update shard information for ClusterInfoUpdateJob wi
thin 15s timeout
[2019-09-21T23:24:55,725][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [10.205.41.134] collector [index_recovery] timed out when collecting data
[2019-09-21T23:25:05,757][ERROR][o.e.x.m.c.i.IndexStatsCollector] [10.205.41.134] collector [index-stats] timed out when collecting data
[2019-09-21T23:25:06,841][INFO ][o.e.c.r.a.AllocationService] [10.205.41.134] updating number_of_replicas to [13] for indices [.security-6]
[2019-09-21T23:25:07,184][INFO ][o.e.c.s.MasterService ] [10.205.41.134] zen-disco-node-failed({10.205.41.183}{m_-YY2D6T365AB2JBPhKkw}{rAxwqH7
RSkK3rMR-XOa_nA}{10.205.41.183}{10.205.41.183:9300}{ml.machine_memory=67386773504, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}),
reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{10.205.41.183}{m_-YY2D6T365AB2JBPhKkw}{rAxwqH7RSkK3rMR-XOa_nA}{10.205.4
1.183}{10.205.41.183:9300}{ml.machine_memory=67386773504, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} failed to ping, tried [3] t
imes, each with maximum [30s] timeout], reason: removed {{10.205.41.183}{m_-YY2D6T365AB2JBPhKkw}{rAxwqH7RSkK3rMR-XOa_nA}{10.205.41.183}{10.205.41
.183:9300}{ml.machine_memory=67386773504, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2019-09-21T23:25:13,910][INFO ][o.e.c.s.ClusterApplierService] [10.205.41.134] removed {{10.205.41.183}{m_-YY2D6T365AB2JBPhKkw}{rAxwqH7RSkK3rMR-
XOa_nA}{10.205.41.183}{10.205.41.183:9300}{ml.machine_memory=67386773504, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason:
apply cluster state (from master [master {10.205.41.134}{3FErqVuXS0yp2FteuAYEoA}{j8bLjZqARJ6SimM07KR5Aw}{10.205.41.134}{10.205.41.134:9300}{ml.ma
chine_memory=33567162368, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [362800] source [zen-disco-node-failed({1
0.205.41.183}{m_-YY2D6T365AB2JBPhKkw}{rAxwqH7RSkK3rMR-XOa_nA}{10.205.41.183}{10.205.41.183:9300}{ml.machine_memory=67386773504, ml.max_open_jobs=
20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{10.205.41.183}{m_-YY2D6T36
5AB2JBPhKkw}{rAxwqH7RSkK3rMR-XOa_nA}{10.205.41.183}{10.205.41.183:9300}{ml.machine_memory=67386773504, ml.max_open_jobs=20, xpack.installed=true,
ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]]])
[2019-09-21T23:25:15,411][DEBUG][o.e.a.a.i.r.TransportRecoveryAction] [10.205.41.134] failed to execute [indices:monitor/recovery] on node [m_-YY
2D6T365AB2JBPhKkw]
org.elasticsearch.transport.NodeDisconnectedException: [10.205.41.183][10.205.41.183:9300][indices:monitor/recovery[n]] disconnected
[2019-09-21T23:25:15,415][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [10.205.41.134] failed to execute [indices:monitor/stats] on node [m_-Y
Y2D6T365AB2JBPhKkw]
org.elasticsearch.transport.NodeDisconnectedException: [10.205.41.183][10.205.41.183:9300][indices:monitor/stats[n]] disconnected
[2019-09-21T23:25:15,418][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [10.205.41.134] failed to execute on node [m_-YY2D6T365AB2JBPhKkw]
org.elasticsearch.transport.NodeDisconnectedException: [10.205.41.183][10.205.41.183:9300][cluster:monitor/stats[n]] disconnected
[2019-09-21T23:25:15,424][WARN ][o.e.x.m.MonitoringService] [10.205.41.134] monitoring execution failed
已邀请:

micmouse521

赞同来自: WeiZhixiong

discovery.zen.ping_timeout: 30s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 6
discovery.zen.fd.ping_interval: 10s
这几个参数可以调整时间长一点.
另外bulk 线程池不要配置太大

locatelli

赞同来自:

看现象可能是cluster负荷较高导致。如果有monitoring可以注意一下cpu/memory的变化。
通过_cat API也可以确认,比如有没有high heap usage之类的

hanxiaobei - 90后小白

赞同来自:

@WeiZhixiong 我这边也出现这样的情况 请问你那边怎么解决的? 原因是啥?

shwtz - 学物理想做演员的IT男

赞同来自:

我也经常遇到这个问题
 
不过有时候这个节点又没事,而且和负载也没关系
只要启动这个节点,就有时候会发生这种情况

WeiZhixiong

赞同来自:

可能是因为配置了类似的配置,导致 GC 时间边长。
可以试着去掉这种配置。
如果性能有问题建议升到 es 7.4, 性能有比较好的提升。
thread_pool.write.queue_size: 1000

要回复问题请先登录注册