On Fri, Sep 19, 2014 at 2:35 AM, Francois Deppierraz <francois at ctrlaltdel.ch > wrote: > Hi Craig, > > I'm planning to completely re-install this cluster with firefly because > I started to see other OSDs crashes with the same trim_object error... > I did lose data because of this, but it was unrelated to the XFS issues. Luckily, it was only RGW replication state, and not something more important. I was having issues with OSDs crashing. I'd mark them out, and the problem would move to a new OSD. I tried using the patch in http://tracker.ceph.com/issues/6101. It worked, but only as long as I ran the patch. When I went back to a stock binary, it started crashing again. It also spammed the logs with warnings instead of crashing. The problem PG was in my RGW .$zone.log pool. It's small, so I pulled all of the objects out of the pool, recreated the pool, and uploaded the objects again. It messed up my replication state, so I'm still sorting that out. It appears to me that the code fix in Firefly ( http://tracker.ceph.com/issues/7595) will prevent the problem from happening, but not correct an already corrupted store. I dropped all my snapshots, and disabled new ones, until I can complete the upgrade. Rebuilding on FireFly should solve your problem. > > So now, I'm more interested in figuring out exactly why data corruption > happened in the first place than repairing the cluster. > I'm not entirely sure from reading http://tracker.ceph.com/issues/7595, but it looks like occasionally creating a snapshot doesn't save the correct information. Then when removing the snapshot, it gets confused and asserts. > > Comments in-line. > > > > > > This is a problem. It's not necessarily a deadlock. The warning is > > printed if the XFS memory allocator has to retry more than 100 times > > when it's trying to allocate memory. It either indicates extremely low > > memory, or extremely fragmented memory. Either way, your OSDs are > > sitting there trying to allocate memory instead of doing something > useful. > > Do you mean that this particular error doesn't imply data corruption but > only bad OSD performances? > That was my experience. That cluster was pretty much unusable, but I was able to access all of my data once I got the cluster healthy. > > By any chance, does your ceph.conf have: > > osd mkfs options xfs = -n size=64k > > > > If so, you should start planning to remove that arg, and reformat every > > OSD. Here's a thread where I discussion my (mis) adventures with XFS > > allocation deadlocks: > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041336.html > > Yes! Thanks for the details, I'm actually using the puppet-ceph module > from enovance which indeed uses [1] the '-n size=64k' option when > formating a new disk. > I would avoid that option when you rebuild your cluster. There is a fix in the 3.14 kernels, but it's not really necessary. That option makes the inodes larger, which should make directories with millions of files in them a bit faster. None of my PGs have more than 10 files in a directory. Every time a directory gets more than a few files in it, Ceph creates some subdirectories, and splits things up. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140919/d3069108/attachment.htm>