On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz <francois at ctrlaltdel.ch> wrote: > Hi Greg, > > Thanks for your support! > > On 08. 09. 14 20:20, Gregory Farnum wrote: > >> The first one is not caused by the same thing as the ticket you >> reference (it was fixed well before emperor), so it appears to be some >> kind of disk corruption. >> The second one is definitely corruption of some kind as it's missing >> an OSDMap it thinks it should have. It's possible that you're running >> into bugs in emperor that were fixed after we stopped doing regular >> support releases of it, but I'm more concerned that you've got disk >> corruption in the stores. What kind of crashes did you see previously; >> are there any relevant messages in dmesg, etc? > > Nothing special in dmesg except probably irrelevant XFS warnings: > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) Hmm, I'm not sure what the outcome of that could be. Googling for the error message returns this as the first result, though: http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58429 Which indicates that it's a real deadlock and capable of messing up your OSDs pretty good. > > All logs from before the disaster are still there, do you have any > advise on what would be relevant? > >> Given these issues, you might be best off identifying exactly which >> PGs are missing, carefully copying them to working OSDs (use the osd >> store tool), and killing these OSDs. Do lots of backups at each >> stage... > > This sounds scary, I'll keep fingers crossed and will do a bunch of > backups. There are 17 pg with missing objects. > > What do you exactly mean by the osd store tool? Is it the > 'ceph_filestore_tool' binary? Yeah, that one. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com