我有点怀疑你在刷屏

ES集群数据节点负载突然升高,拒绝检索请求

Elasticsearch | 作者 ESer | 发布于2019年01月18日 | 阅读数:5989

一个12节点的集群,六台机器,每台机器两个节点。机器内存128G,32核。 每个es实例分配了31G的内存。

昨晚所有数据节点的search线程池用尽,队列也占满了,导致ES抛出EsRejectedExecutionException,拒绝检索请求。search线程池的大小是49,queue size是1000。 故障期间,并没有异常大量的检索请求,负载并不高。

今天一直在排查这个问题,发现一个异常,不知道是否跟这个问题有关。异常信息如下:
[2019-01-17 15:49:45,833][DEBUG][action.admin.cluster.node.stats] [http_node] failed to execute on node [TcpC3thETgSsGwbMZi
M01A]
RemoteTransportException[[Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.stats.NodeSta
ts]]]; nested: TransportSerializationException[Failed to deserialize response of type [org.elasticsearch.action.admin.clust
er.node.stats.NodeStats]]; nested: EOFException;
Caused by: TransportSerializationException[Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.n
ode.stats.NodeStats]]; nested: EOFException;
at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:152)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:124)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.
java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.
已邀请:

JackGe

赞同来自: ESer laoyang360

堆栈信息没有发全吧。使用2.x版本的ES也遇到过EOFException错误,当时在错误堆栈如下
Caused by: java.io.EOFException
at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:43)
at org.elasticsearch.common.io.stream.FilterStreamInput.readByte(FilterStreamInput.java:39)
at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:254)
at org.elasticsearch.index.search.stats.SearchStats.readFrom(SearchStats.java:311)
at org.elasticsearch.index.search.stats.SearchStats.readSearchStats(SearchStats.java:299)
at org.elasticsearch.action.admin.indices.stats.CommonStats.readFrom(CommonStats.java:527)
at org.elasticsearch.action.admin.indices.stats.CommonStats.readCommonStats(CommonStats.java:484)
at org.elasticsearch.action.admin.indices.stats.ShardStats.readFrom(ShardStats.java:97)
at org.elasticsearch.action.admin.indices.stats.ShardStats.readShardStats(ShardStats.java:90)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.readShardResult(TransportIndicesStatsAction.java:80)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.readShardResult(TransportIndicesStatsAction.java:1)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.readFrom(TransportBroadcastByNodeAction.java:544)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:187)
... 23 more
在读取NodeStats信息时,由于保存统计值使用的是long类型,当发生数据溢出时,StreamOutput的writeVlong方法会写入10个字节,而StreamInput的readVLong读取了9个字节,导致少读取一个字节,然后ES有以下判断
public byte readByte() throws IOException {
int ch = is.read();
if (ch < 0)
throw new EOFException();
return (byte) (ch);
}
读取到的字节小于0的值就抛出EOFException异常。
解决这个问题可以通过修改readVLong方法,让该方法也能读取第10个字节。看你的问题描述search队列被打满,可能导致ThreadPoolStats中的long rejected数据溢出。
 

kennywu76 - Wood

赞同来自:

集群版本是多少,如果是 5.1.2之前的版本,参考这个问题: https://elasticsearch.cn/question/1716
 
 

要回复问题请先登录注册