On 07/12/2018 14.48, Jonas Jelten wrote: > On 06/12/2018 19.25, Gregory Farnum wrote: >>> So, overall I suspect there is a bug which prevents remapped pg data to be discovered. The PG already knows which OSD is >>> the correct candidate, but does not query it. >>> >>> >>> I can try fixing this myself, but I'd need some hints from the developers to relevant code parts. >>> >>> The OSD is stored correctly in pg->might_have_unfound, and I think it should be queried in PG::discover_all_missing, but >>> I'm lost there. I'd appreciate any help tracking this down. >> >> Do you have logging indicating that this particular function is where >> it goes wrong, or did you find it by inspection? >> Since it sounds like this is pretty reproducible, I would try doing >> that with "debug osd = 20" set, and read through the primary's log >> very carefully while it makes these decisions. >> -Greg >> > > I found that function by inspection of the sources and trying to figure > out where the status displayed in pg query might emerge. > > I'll see if I can set up a test cluster and reproduce it there, I'd > rather not put the production cluster under more load then necessary > once again :) > > > -- Jonas > I now tested this on a 4-node 16-osd 3-replica-only cluster. Easy steps to reproduce seem to be: * Have a healthy cluster * ceph osd set pause # make sure no writes mess up the test * ceph osd set nobackfill * ceph osd set norecover # make sure the error is not recovered but instead stays * ceph tell 'osd.*' injectargs '--debug_osd=20/20' # turn up logging * ceph osd out $osdid # take out a random osd * observe the state, now objects are degraded already, check pg query. In my test, I observe that $osdid was "already probed" but it does have the data, the cluster was completely healthy before. * ceph osd down $osdid # repeer this osd, it'll come up again right away * observe the state again, even more objects are degraded now, check pg query. In my test, $osdid is now "not queried" * ceph osd in $osdid # everything turns back to normal and healthy * ceph tell 'osd.*' injectargs '--debug_osd=1/5' # silence logging again * ceph osd unset ... # unset the flags In summary: while preventing recovery, an out osd produces degraded objects. An out and repeered OSD produces even more degraded objects. Taking it in again will discover all missing object copies.