节点宕机数据丢失后部分分片无法分配 can not allocate

Elasticsearch | 作者 coding_hl | 发布于2022年06月08日 | 阅读数：2626

集群:1主2数共3个节点
索引配置: 1个分片1个副本,主副本随机分配在2个数据节点上
场景:数据节点1因为硬盘原因直接挂了,服务器宕机,该节点下线.为了保证业务正常运行,集群的录入仍然在进行中,
节点下线后硬盘无法恢复,数据全部丢失.我们在该服务器上重新部署了一个es实例,采用原来的配置,启动后该节点加入到集群中,大部分索引都恢复正常,但是部分索引的分片一直处于unassigned状态.

unassigned.reason:CLUSTER_RECOVERED
allocate_explanation: cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster
GET _cluster/allocation/explain
{
"index": "xxxxx",
"shard": 0,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "CLUSTER_RECOVERED",
"at": "2022-06-01T08:03:26.148Z",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions": [
{
"node_id": "99djMuR7QvSsFldV-HnpIQ",
"node_name": "node-2",
"transport_address": "10.0.0.208:9300",
"node_attributes": {
"ml.machine_memory": "67293700096",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "HyR6JVkjRwmGcfqYtkFd0Q",
"node_name": "node-3",
"transport_address": "10.0.0.209:9300",
"node_attributes": {
"ml.machine_memory": "67295186944",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
}
]
}
由于只有1个分片,所以该分片无法分配后,导致集群中目前已经无法查询到该索引
GET _cat/recovery/XXX 信息如下:
bytes_total translog_ops translog_ops_recovered translog_ops_percent

尝试进行手动分配仍然不起作用的
POST /_cluster/reroute
{
"commands": [
{
"allocate_stale_primary": {
"index": "xxxxx",
"shard": 0,
"node": "node-2",
"accept_data_loss": true
}
}
]
}

问题:
该节点下线后,另外一个节点上的副本分片都会被提升为主分片,然后正常进行操作,等到该节点重新加入集群后(数据已经丢失),集群会重新对数据进行操作,使得集群恢复green,那么为什么这些主分片不行呢?网上说是数据损坏了,但是这个节点一直正常工作,为什么会出现损坏呢?
目前集群没有做快照,有什么方法可以恢复这些数据呢?网络上都是分配一个空的主分片,这会导致该分片数据的丢失.

5 个回复

coding_hl

es版本是6.6.2

Charele - Cisco4321

你说的“另外一个节点上的副本分片都会被提升为主分片，，，”这些，理论上应该是的。
不过理论归理论，实际会是啥，谁也不晓得。

你可以先用GET xxxxx看下这个索引的uuid，

然后到ES的数据目录里面，看看是不是有这个索引存在，
如果没有，或者说此索引的数据量很大，但索引目录只有几k，几M这种的，说明真的没有了。
如果有，且大小差不多，可能还有别的办法。

coding_hl

查看了数据的文件目录确实已经没有数据了,所以奇怪的是,数据节点1下线后,数据节点2上仍然拥有该索引的一个副本,集群应该将该副本分片提升为主分片,事实上,在数据节点1上并没有任何关于该索引的分片信息,奇了怪了?

coding_hl

由于是3个节点的集群,当时数据节点1下线的时候是主节点,所以该服务器挂了以后,因为设置的discovery.zen.minimum_master_nodes为3集群无法选举,所以手动修改了值为1,然后重启了另外两个节点,不知道是不是这个原因导致的.....

coding_hl

a节点： node.master: true node.data: false
b节点： node.master: true node.data: true
c节点： node.master: true node.data: true

此时c节点是主节点，c节点因为硬盘挂了，宕机了。集群没有主节点了，且无法进行选举(discovery.zen.minimum_master_nodes=3)，所以我把a和b的配置discovery.zen.minimum_master_nodes改成1，重启了a和b，让集群正常工作。
后来c节点上的硬盘数据恢复失败，数据直接没了，所以在c上重新部署了es实例，用了c上的原来的配置，但是数据没了。c启动后，正常加入了集群。
索引的设置是一个分片一个副本，这个分片和副本基本上就是b上一个分片，c上一个分片。
c加入集群前，此时大部分的索引均是yellow状态。
c加入集群后，集群自动将b上的一个分片移动到c上，此时索引状态变为green了。

而那些不能够恢复的索引，数据目录已经没有了。按理来说，c下线的时候，有一个分片在b上，那么此时这个分片会被提升为主分片，然后正常进行读写，可是为什么在b上的部分分片数据没了？

要回复问题请先登录或注册

节点宕机数据丢失后部分分片无法分配 can not allocate

5 个回复

发起人

活动推荐

相关问题

问题状态

节点宕机数据丢失后部分分片无法分配 can not allocate

与内容相关的链接

5 个回复

发起人

活动推荐

相关问题

问题状态