If the file structure is corrupted, then all bets are kind of off. You'd have to characterize precisely the kind of corruption you want handled and add a feature request for that. -Sam On Sat, Dec 27, 2014 at 5:14 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: > On Sat, Dec 27, 2014 at 4:09 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: >>> Oh, that's a bit less interesting. The bug might be still around though. >>> -Sam >>> >>> On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>> On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: >>>>> You'll have to reproduce with logs on all three nodes. I suggest you >>>>> open a high priority bug and attach the logs. >>>>> >>>>> debug osd = 20 >>>>> debug filestore = 20 >>>>> debug ms = 1 >>>>> >>>>> I'll be out for the holidays, but I should be able to look at it when >>>>> I get back. >>>>> -Sam >>>>> >>>> >>>> >>>> Thanks Sam, >>>> >>>> although I am not sure if it makes not only a historical interest (the >>>> mentioned cluster running cuttlefish), I`ll try to collect logs for >>>> scrub. >> >> Same stuff: >> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg15447.html >> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg14918.html >> >> Looks like issue is still with us, though it requires meta or file >> structure corruption to show itself. I`ll check if it can be >> reproduced via rsync -X sec pg subdir -> pri pg subdir or vice-versa. >> Mine case shows slightly different pathnames for same objects with >> same checksums, may be a root reason then. As every case mentioned, >> including mine, happened in oh-shit-hardware-is-broken case, I suggest >> that the incurable corruption happens during primary backfill from >> active replica at the recovery time. > > Recovery/backfill from corrupted primary copy results to crash > (attached) of primary OSD, for example it can be triggered by purging > one of secondary copies (top of cuttlefish branch for line numbers). > Although as secondaries preserve same data with same checksums, it is > possible to destroy both meta record and pg directory and refill > primary back. The interesting point is that the corrupted primary was > completely refilled after hardware failure, but looks like it survived > long enough after a failure event to spread corruption to the copies, > I simply can not imagine better explanation. > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com