Stuck down+peering after host failure.

Aaron Bassett <Aaron.Bassett@xxxxxxxxxxxxx> · Mon, 11 Dec 2017 14:02:37 +0000

Morning All,
I have a large-ish (16 node, 1100 osds) cluster I recent had to move from one DC to another. Before shutting everything down, I set noout, norecover, and nobackfill, thinking this would help everything stand back up again. Upon installation at the new DC, one of the nodes refused to boot. With my crush rule having the failure domain as host, I did not think this would be a problem. However, once I turned off noout, norecover, and nobackfille, everything else came up and settled in, I still have 1545 pgs stuck down+peering. On other pgs, recovery and backfilling are proceeding as expected, but these pgs appear to be permanently stuck. When querying the down+peering pgs, they all mention pgs from the down node in ""down_osds_we_would_probe". I'm not sure why it *needs* to query these since it should have two other copies on other nodes? I'm not sure if bringing everything up with noout or norecover on confused things. Looking for advice...

Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com