CEPH 12.2.11, pool size 3, min_size 2.
One node went down today(private network interface started flapping, and after a while OSD processes crashed), no big deal, cluster recovered, but not completely. 1 PG stuck in active+clean+remapped state.
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
20.a2 511 0 0 511 0 1584410172 1500 1500 active+clean+remapped 2019-03-26 20:50:18.639452 96149'189204 96861:935872 [26,14] 26 [26,14,9] 26 96149'189204 2019-03-26 10:47:36.174769 95989'187669 2019-03-22 23:29:02.322848 0
it states it's placed on 26,14 OSDs, should be on 26,14,9. As far as I can see there's nothing wrong with any of those OSDs, they work, host other PGs, peer with each other, etc. I tried restarting all of them one after another, but without any success.
OSD 9 hosts 95 other PGs, don't think it's PG overdose.
Last line of log from osd.9 mentioning PG 20.a2:
2019-03-26 20:50:16.294500 7fe27963a700 1 osd.9 pg_epoch: 96860 pg[20.a2( v 96149'189204 (95989'187645,96149'189204] local-lis/les=96857/96858 n=511 ec=39164/39164 lis/c 96857/96855 les/c/f 96858/96856/66611 96859/96860/96855) [26,14]/[26,14,9] r=2 lpr=96860 pi=[96855,96860)/1 crt=96149'189204 lcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
Nothing else out of ordinary, just usual scrubs/deep-scrubs notifications.
Any ideas what it it can be, or any other steps to troubleshoot this?
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com