Re: Degraded PG does not discover remapped data on originating OSD

Jonas Jelten <jelten@xxxxxxxxx> · Thu, 13 Dec 2018 13:38:50 +0100

On 07/12/2018 14.48, Jonas Jelten wrote:
> On 06/12/2018 19.25, Gregory Farnum wrote:
>>> So, overall I suspect there is a bug which prevents remapped pg data to be discovered. The PG already knows which OSD is
>>> the correct candidate, but does not query it.
>>>
>>>
>>> I can try fixing this myself, but I'd need some hints from the developers to relevant code parts.
>>>
>>> The OSD is stored correctly in pg->might_have_unfound, and I think it should be queried in PG::discover_all_missing, but
>>> I'm lost there. I'd appreciate any help tracking this down.
>>
>> Do you have logging indicating that this particular function is where
>> it goes wrong, or did you find it by inspection?
>> Since it sounds like this is pretty reproducible, I would try doing
>> that with "debug osd = 20" set, and read through the primary's log
>> very carefully while it makes these decisions.
>> -Greg
>>
> 
> I found that function by inspection of the sources and trying to figure
> out where the status displayed in pg query might emerge.
> 
> I'll see if I can set up a test cluster and reproduce it there, I'd
> rather not put the production cluster under more load then necessary
> once again :)
> 
> 
> -- Jonas
> 

I now tested this on a 4-node 16-osd 3-replica-only cluster.
Easy steps to reproduce seem to be:

* Have a healthy cluster
* ceph osd set pause                                # make sure no writes mess up the test
* ceph osd set nobackfill
* ceph osd set norecover                            # make sure the error is not recovered but instead stays
* ceph tell 'osd.*' injectargs '--debug_osd=20/20'  # turn up logging
* ceph osd out $osdid # take out a random osd
* observe the state, now objects are degraded already, check pg query.
  In my test, I observe that $osdid was "already probed" but it does have the data,
  the cluster was completely healthy before.
* ceph osd down $osdid                              # repeer this osd, it'll come up again right away
* observe the state again, even more objects are degraded now, check pg query.
  In my test, $osdid is now "not queried"
* ceph osd in $osdid                                # everything turns back to normal and healthy
* ceph tell 'osd.*' injectargs '--debug_osd=1/5'    # silence logging again
* ceph osd unset ...                                # unset the flags

In summary: while preventing recovery, an out osd produces degraded objects. An out and repeered OSD produces even more
degraded objects. Taking it in again will discover all missing object copies.