Hi Craig, I'm planning to completely re-install this cluster with firefly because I started to see other OSDs crashes with the same trim_object error... So now, I'm more interested in figuring out exactly why data corruption happened in the first place than repairing the cluster. Comments in-line. On 16. 09. 14 23:53, Craig Lewis wrote: > On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz > <francois at ctrlaltdel.ch <mailto:francois at ctrlaltdel.ch>> wrote: > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > All logs from before the disaster are still there, do you have any > advise on what would be relevant? > > This is a problem. It's not necessarily a deadlock. The warning is > printed if the XFS memory allocator has to retry more than 100 times > when it's trying to allocate memory. It either indicates extremely low > memory, or extremely fragmented memory. Either way, your OSDs are > sitting there trying to allocate memory instead of doing something useful. Do you mean that this particular error doesn't imply data corruption but only bad OSD performances? > By any chance, does your ceph.conf have: > osd mkfs options xfs = -n size=64k > > If so, you should start planning to remove that arg, and reformat every > OSD. Here's a thread where I discussion my (mis) adventures with XFS > allocation deadlocks: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041336.html Yes! Thanks for the details, I'm actually using the puppet-ceph module from enovance which indeed uses [1] the '-n size=64k' option when formating a new disk. Fran?ois [1] https://github.com/enovance/puppet-ceph/blob/master/manifests/osd/device.pp#L44