PG stuck in active+clean+remapped

Vladimir Prokofev <v@xxxxxxxxxxx> · Tue, 26 Mar 2019 23:02:24 +0300

CEPH 12.2.11, pool size 3, min_size 2.
One node went down today(private network interface started flapping, and after a while OSD processes crashed), no big deal, cluster recovered, but not completely. 1 PG stuck in active+clean+remapped state.

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES       LOG  DISK_LOG STATE                 STATE_STAMP                VERSION         REPORTED        UP         UP_PRIMARY ACTING     ACTING_PRIMARY LAST_SCRUB      SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
20.a2       511                  0        0       511       0  1584410172 1500     1500 active+clean+remapped 2019-03-26 20:50:18.639452    96149'189204    96861:935872    [26,14]         26  [26,14,9]             26    96149'189204 2019-03-26 10:47:36.174769    95989'187669 2019-03-22 23:29:02.322848             0

it states it's placed on 26,14 OSDs, should be on 26,14,9. As far as I can see there's nothing wrong with any of those OSDs, they work, host other PGs, peer with each other, etc. I tried restarting all of them one after another, but without any success.
OSD 9 hosts 95 other PGs, don't think it's PG overdose.

Last line of log from osd.9 mentioning PG 20.a2:
2019-03-26 20:50:16.294500 7fe27963a700  1 osd.9 pg_epoch: 96860 pg[20.a2( v 96149'189204 (95989'187645,96149'189204] local-lis/les=96857/96858 n=511 ec=39164/39164 lis/c 96857/96855 les/c/f 96858/96856/66611 96859/96860/96855) [26,14]/[26,14,9] r=2 lpr=96860 pi=[96855,96860)/1 crt=96149'189204 lcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray

Nothing else out of ordinary, just usual scrubs/deep-scrubs notifications.
Any ideas what it it can be, or any other steps to troubleshoot this?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com