集群恢复时卡住

Elasticsearch | 作者 juin | 发布于2018年10月08日 | 阅读数：3576

版本：6.1.1
Nodes: 4(master+data)
Indices: 3352
Memory: 51GB / 123GB
Total Shards: 33379
Documents: 1,456,904,986
Data: 492GB

国庆前进行了滚动重启，本以为数据会缓慢恢复，但是来了一看发现卡住了
查看一下任务

GET _cat/recovery?v&h=i,s,t,ty,st,fp,bp,shost,thost

查看线程

GET _cat/thread_pool?v

整个集群状态为yellow,只要是从其他节点恢复数据的任务全部卡住，这是我的集群设置

{

  "persistent": {

    "cluster": {

      "routing": {

        "allocation": {

          "allow_rebalance": "always",

          "cluster_concurrent_rebalance": "500",

          "enable": "all"

        }

      }

    },

    "discovery": {

      "zen": {

        "minimum_master_nodes": "2"

      }

    },

    "indices": {

      "recovery": {

        "max_bytes_per_sec": "600mb"

      }

    }

  },

  "transient": {

    "cluster": {

      "routing": {

        "allocation": {

          "node_concurrent_incoming_recoveries": "500",

          "node_initial_primaries_recoveries": "4",

          "enable": "all",

          "node_concurrent_outgoing_recoveries": "500"

        }

      }

    }

  }

}

请问这个问题出在哪。。。

3 个回复

juin - 大数据开发

取消了副本，在负载均衡的时候不断报

[WARN ][o.e.i.c.IndicesClusterStateService] [es-node-211] [[my_index][4]] marking and sending shard failed due to [failed recovery]

org.elasticsearch.indices.recovery.RecoveryFailedException: [my_index][4]: Recovery failed from {es-node-211}{WPC32CtxTtOiTPuCseqF8g}{AyjHnVtwSnik2Rcu_SQg8A}{192.168.0.205}{192.168.0.205:9300} into {es-node-208}{fa9ZVqyXSHKhYHvAhr8x6w}{ECrBVQS_QPOXtc9E0is9Tw}{192.168.0.149}{192.168.0.149:9300}

.

zqc0512 - andy zhou

"node_concurrent_incoming_recoveries": "500",
"node_concurrent_outgoing_recoveries": "500"
"cluster_concurrent_rebalance": "500",
节点多么？这几个值过大了。

yayg2008

猜测是你的恢复参数设的不合理，过大。
比如网络限制600mb，每秒600M，起码得万兆网卡才行；
比如cluster_concurrent_rebalance，500并发，也比较离谱，除非你的机器非常强大。

要回复问题请先登录或注册

集群恢复时卡住

3 个回复

发起人

活动推荐

相关问题

问题状态

集群恢复时卡住

与内容相关的链接

3 个回复

发起人

活动推荐

相关问题

问题状态