你不会是程序猿吧?

elasticsearch集群不明原因挂掉

Elasticsearch | 作者 osborn | 发布于2018年06月07日 | 阅读数:9838

集群是开在阿里云上的虚拟主机,有9台机器,32核的,上面布了10几个节点,3个备选master。早上集群在访问量很小的情况下自己挂掉了,所有shard全都是unassigned,es进程还在,但调用_cat接口返回特别慢,或者干脆没响应。选举出的master所在的主机负载很高。
一查日志,发现很多data节点从半夜里就开始记录一些gc信息,在平时的日志里没见过:
[2018-06-07 01:30:04,203][WARN ][monitor.jvm              ] [iz2ze1qzjeqgsoycwvxov0z_1] [gc][young][4094764][158050] duration [1.4s], collections [1]/[2.2s], total [1.4s]/[58.4m], memory [4.5gb]->[3.4gb]/[9.8gb], all_pools {[young] [1.4
gb]->[18.2mb]/[1.4gb]}{[survivor] [177.7mb]->[191.3mb]/[191.3mb]}{[old] [2.9gb]->[3.2gb]/[8.1gb]}
[2018-06-07 01:30:34,646][WARN ][monitor.jvm              ] [iz2ze1qzjeqgsoycwvxov0z_1] [gc][young][4094794][158052] duration [1s], collections [1]/[1.1s], total [1s]/[58.5m], memory [4.9gb]->[3.6gb]/[9.8gb], all_pools {[young] [1.4gb]-
>[16.5kb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [3.3gb]->[3.4gb]/[8.1gb]}
大约几十秒记一条。

然后master节点,先是报很多这个错误:
[2018-06-07 00:25:02,513][DEBUG][action.admin.cluster.stats] [iz2ze1qzjeqgsoycwvxov0z_master] failed to execute on node [7l05rRHmQeKQ4k6QUFRu3A]
RemoteTransportException[[xxx.xx.xx.xx][xxx.xx.xx.xx:9300][cluster:monitor/stats[n]]]; nested: AlreadyClosedException[this IndexReader is closed];
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed

[2018-06-07 02:35:47,285][DEBUG][action.admin.cluster.stats] [iz2ze1qzjeqgsoycwvxov0z_master] failed to execute on node [7l05rRHmQeKQ4k6QUFRu3A]
RemoteTransportException[[xxx.xx.xx.xx][xxx.xx.xx.xx:9300][cluster:monitor/stats[n]]]; nested: AlreadyClosedException[this IndexReader cannot be used anymore as one of its child readers was closed];
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader cannot be used anymore as one of its child readers was closed

然后开始报超时错误:
[2018-06-07 09:12:40,720][WARN ][transport                ] [iz2ze1qzjeqgsoycwvxov0z_master] Received response for a request that has timed out, sent [20323ms] ago, timed out [5323ms] ago, action [cluster:monitor/nodes/stats[n]], node [
{iz2ze1qzjeqgsoycwvxouzz}{qT0-LDQVRMu6t36N5regGg}{xxx.xx.xx.xx}{xxx.xx.xx.xx:9300}{master=false}], id [300305326]
[2018-06-07 09:12:41,787][WARN ][transport                ] [iz2ze1qzjeqgsoycwvxov0z_master] Received response for a request that has timed out, sent [30690ms] ago, timed out [690ms] ago, action [internal:discovery/zen/fd/ping], node [{
iz2ze1qzjeqgsoycwvxov0z_2}{bbdJtx1DRCqQKSuG5vnDTg}{xxx.xx.xx.xx}{xxx.xx.xx.xx:9302}{master=false}], id [300305119]

于是集群就宕掉了。但master不断超时的情况下集群也没有选举新的master。ping相关的配置用的都是默认值。夜里和早上访问量、写入都很低,外部环境看上去一切正常,集群自己出问题。这类问题之前也碰到过2次,不知为什么会发生,但基本上重启一遍集群就好了。只是重启集群需要时间,这期间搜索服务一直不可用,也不能写入,影响业务。
有没有人碰到过类似的情况?可能会是什么原因?有办法靠监控或者预警等手段提前发现吗?有没有可能跟我们用阿里云的机器有关系?
已邀请:

zqc0512 - andy zhou

赞同来自:

有性能瓶颈了。连接超时时间调整大些……默认30s处理不过来了。

要回复问题请先登录注册