On Tue, Oct 10, 2017 at 12:30 PM, Daniel Baumann <daniel.baumann@xxxxxx> wrote: > Hi John, > > thank you very much for your help. > > On 10/10/2017 12:57 PM, John Spray wrote: >> A) Do a "rados -p <metadata pool> ls | grep "^506\." or similar, to >> get a list of the objects > > done, gives me these: > > 506.00000000 > 506.00000017 > 506.0000001b > 506.00000019 > 506.0000001a > 506.0000001c > 506.00000018 > 506.00000016 > 506.0000001e > 506.0000001f > 506.0000001d > >> B) Write a short bash loop to do a "rados -p <metadata pool> get" on >> each of those objects into a file. > > done, saved them as the object name as filename, resulting in these 11 > files: > > 90 Oct 10 13:17 506.00000000 > 4.0M Oct 10 13:17 506.00000016 > 4.0M Oct 10 13:17 506.00000017 > 4.0M Oct 10 13:17 506.00000018 > 4.0M Oct 10 13:17 506.00000019 > 4.0M Oct 10 13:17 506.0000001a > 4.0M Oct 10 13:17 506.0000001b > 4.0M Oct 10 13:17 506.0000001c > 4.0M Oct 10 13:17 506.0000001d > 4.0M Oct 10 13:17 506.0000001e > 4.0M Oct 10 13:17 506.0000001f > >> C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20", >> mark the rank repaired, start the MDS again, and then gather the >> resulting log (it should end in the same "Error -22 recovering >> write_pos", but have much much more detail about what came before). > > I've attached the entire log from right before issueing "repaired" until > after the mds drops to standby again. > >> Because you've hit a serious bug, it's really important to gather all >> this and share it, so that we can try to fix it and prevent it >> happening again to you or others. > > absolutely, sure. If you need anything more, I'm happy to share. > >> You have two options, depending on how much downtime you can tolerate: >> - carefully remove all the metadata objects that start with 506. -- > > given the outtage (and people need access to their data), I'd go with > this. Just to be safe: that would go like this? > > rados -p <metadata_pool> rm 506.00000000 > rados -p <metadata_pool> rm 506.00000016 Yes. Do a final ls to make sure you got all of them -- it is dangerous to leave any fragments behind. BTW opened http://tracker.ceph.com/issues/21749 for the underlying bug. John > [...] > > Regards, > Daniel > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com