Luminous: resilience - private interface down , no read/write

nokia ceph <nokiacephusers@xxxxxxxxx> · Tue, 22 May 2018 11:17:01 +0530

Hi Ceph users,
We have a cluster with 5 node (67 disks) and EC 4+1 configuration and min_size set as 4.
Ceph version : 12.2.5
While executing one of our resilience usecase , making private interface down on one of the node, till kraken we saw less outage in rados (60s) .

Now with luminous, we could able to see rados read/write outage for more than 200s . In the logs we could able to see that peer OSDs inform that one of the node OSDs are down however the OSDs  defend like it is wrongly marked down and does not move to down state for long time.

2018-05-22 05:37:17.871049 7f6ac71e6700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.1 down, but it is still running
2018-05-22 05:37:17.871072 7f6ac71e6700  0 log_channel(cluster) log [DBG] : map e35690 wrongly marked me down at e35689
2018-05-22 05:37:17.878347 7f6ac71e6700  0 osd.1 35690 crush map has features 1009107927421960192, adjusting msgr requires for osds
2018-05-22 05:37:18.296643 7f6ac71e6700  0 osd.1 35691 crush map has features 1009107927421960192, adjusting msgr requires for osds

Only when all 67 OSDs are move to down state , the read/write traffic is resumed.

Could you please help us in resolving this issue and if it is bug , we will create corresponding ticket.

Thanks,
Muthu
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com