Re: Weird scrub problem

Samuel Just <sam.just@xxxxxxxxxxx> · Fri, 2 Jan 2015 08:42:33 -0800



If the file structure is corrupted, then all bets are kind of off.
You'd have to characterize precisely the kind of corruption you want
handled and add a feature request for that.
-Sam

On Sat, Dec 27, 2014 at 5:14 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
> On Sat, Dec 27, 2014 at 4:09 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>> On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>>> Oh, that's a bit less interesting.  The bug might be still around though.
>>> -Sam
>>>
>>> On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>> On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>>>>> You'll have to reproduce with logs on all three nodes.  I suggest you
>>>>> open a high priority bug and attach the logs.
>>>>>
>>>>> debug osd = 20
>>>>> debug filestore = 20
>>>>> debug ms = 1
>>>>>
>>>>> I'll be out for the holidays, but I should be able to look at it when
>>>>> I get back.
>>>>> -Sam
>>>>>
>>>>
>>>>
>>>> Thanks Sam,
>>>>
>>>> although I am not sure if it makes not only a historical interest (the
>>>> mentioned cluster running cuttlefish), I`ll try to collect logs for
>>>> scrub.
>>
>> Same stuff:
>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg15447.html
>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg14918.html
>>
>> Looks like issue is still with us, though it requires meta or file
>> structure corruption to show itself. I`ll check if it can be
>> reproduced via rsync -X sec pg subdir -> pri pg subdir or vice-versa.
>> Mine case shows slightly different pathnames for same objects with
>> same checksums, may be a root reason then. As every case mentioned,
>> including mine, happened in oh-shit-hardware-is-broken case, I suggest
>> that the incurable corruption happens during primary backfill from
>> active replica at the recovery time.
>
> Recovery/backfill from corrupted primary copy results to crash
> (attached) of primary OSD, for example it can be triggered by purging
> one of secondary copies (top of cuttlefish branch for line numbers).
> Although as secondaries preserve same data with same checksums, it is
> possible to destroy both meta record and pg directory and refill
> primary back. The interesting point is that the corrupted primary was
> completely refilled after hardware failure, but looks like it survived
> long enough after a failure event to spread corruption to the copies,
> I simply can not imagine better explanation.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com