elasticsearch集群不明原因挂掉

Elasticsearch | 作者 osborn | 发布于2018年06月07日 | 阅读数：10811

集群是开在阿里云上的虚拟主机，有9台机器，32核的，上面布了10几个节点，3个备选master。早上集群在访问量很小的情况下自己挂掉了，所有shard全都是unassigned，es进程还在，但调用_cat接口返回特别慢，或者干脆没响应。选举出的master所在的主机负载很高。
一查日志，发现很多data节点从半夜里就开始记录一些gc信息，在平时的日志里没见过：
[2018-06-07 01:30:04,203][WARN ][monitor.jvm              ] [iz2ze1qzjeqgsoycwvxov0z_1] [gc][young][4094764][158050] duration [1.4s], collections [1]/[2.2s], total [1.4s]/[58.4m], memory [4.5gb]->[3.4gb]/[9.8gb], all_pools {[young] [1.4
gb]->[18.2mb]/[1.4gb]}{[survivor] [177.7mb]->[191.3mb]/[191.3mb]}{[old] [2.9gb]->[3.2gb]/[8.1gb]}
[2018-06-07 01:30:34,646][WARN ][monitor.jvm              ] [iz2ze1qzjeqgsoycwvxov0z_1] [gc][young][4094794][158052] duration [1s], collections [1]/[1.1s], total [1s]/[58.5m], memory [4.9gb]->[3.6gb]/[9.8gb], all_pools {[young] [1.4gb]-
>[16.5kb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [3.3gb]->[3.4gb]/[8.1gb]}
大约几十秒记一条。

然后master节点，先是报很多这个错误：
[2018-06-07 00:25:02,513][DEBUG][action.admin.cluster.stats] [iz2ze1qzjeqgsoycwvxov0z_master] failed to execute on node [7l05rRHmQeKQ4k6QUFRu3A]
RemoteTransportException[[xxx.xx.xx.xx][xxx.xx.xx.xx:9300][cluster:monitor/stats[n]]]; nested: AlreadyClosedException[this IndexReader is closed];
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed

[2018-06-07 02:35:47,285][DEBUG][action.admin.cluster.stats] [iz2ze1qzjeqgsoycwvxov0z_master] failed to execute on node [7l05rRHmQeKQ4k6QUFRu3A]
RemoteTransportException[[xxx.xx.xx.xx][xxx.xx.xx.xx:9300][cluster:monitor/stats[n]]]; nested: AlreadyClosedException[this IndexReader cannot be used anymore as one of its child readers was closed];
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader cannot be used anymore as one of its child readers was closed

然后开始报超时错误：
[2018-06-07 09:12:40,720][WARN ][transport                ] [iz2ze1qzjeqgsoycwvxov0z_master] Received response for a request that has timed out, sent [20323ms] ago, timed out [5323ms] ago, action [cluster:monitor/nodes/stats[n]], node [
{iz2ze1qzjeqgsoycwvxouzz}{qT0-LDQVRMu6t36N5regGg}{xxx.xx.xx.xx}{xxx.xx.xx.xx:9300}{master=false}], id [300305326]
[2018-06-07 09:12:41,787][WARN ][transport                ] [iz2ze1qzjeqgsoycwvxov0z_master] Received response for a request that has timed out, sent [30690ms] ago, timed out [690ms] ago, action [internal:discovery/zen/fd/ping], node [{
iz2ze1qzjeqgsoycwvxov0z_2}{bbdJtx1DRCqQKSuG5vnDTg}{xxx.xx.xx.xx}{xxx.xx.xx.xx:9302}{master=false}], id [300305119]

于是集群就宕掉了。但master不断超时的情况下集群也没有选举新的master。ping相关的配置用的都是默认值。夜里和早上访问量、写入都很低，外部环境看上去一切正常，集群自己出问题。这类问题之前也碰到过2次，不知为什么会发生，但基本上重启一遍集群就好了。只是重启集群需要时间，这期间搜索服务一直不可用，也不能写入，影响业务。
有没有人碰到过类似的情况？可能会是什么原因？有办法靠监控或者预警等手段提前发现吗？有没有可能跟我们用阿里云的机器有关系？

1 个回复

zqc0512 - andy zhou

有性能瓶颈了。连接超时时间调整大些……默认30s处理不过来了。

要回复问题请先登录或注册

elasticsearch集群不明原因挂掉

1 个回复

发起人

活动推荐

相关问题

问题状态

elasticsearch集群不明原因挂掉

与内容相关的链接

1 个回复

发起人

活动推荐

相关问题

问题状态