osd crash: trim_objectcould not find coid

francois@xxxxxxxxxxxxx (Francois Deppierraz) · Fri, 19 Sep 2014 11:35:18 +0200

Hi Craig,

I'm planning to completely re-install this cluster with firefly because
I started to see other OSDs crashes with the same trim_object error...

So now, I'm more interested in figuring out exactly why data corruption
happened in the first place than repairing the cluster.

Comments in-line.

On 16. 09. 14 23:53, Craig Lewis wrote:
> On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
> <francois at ctrlaltdel.ch <mailto:francois at ctrlaltdel.ch>> wrote:
> 
>     XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> 
>     All logs from before the disaster are still there, do you have any
>     advise on what would be relevant?
> 
> This is a problem.  It's not necessarily a deadlock.  The warning is
> printed if the XFS memory allocator has to retry more than 100 times
> when it's trying to allocate memory.  It either indicates extremely low
> memory, or extremely fragmented memory.  Either way, your OSDs are
> sitting there trying to allocate memory instead of doing something useful.

Do you mean that this particular error doesn't imply data corruption but
only bad OSD performances?

> By any chance, does your ceph.conf have:
> osd mkfs options xfs = -n size=64k
> 
> If so, you should start planning to remove that arg, and reformat every
> OSD.  Here's a thread where I discussion my (mis) adventures with XFS
> allocation deadlocks:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041336.html

Yes! Thanks for the details, I'm actually using the puppet-ceph module
from enovance which indeed uses [1] the '-n size=64k' option when
formating a new disk.

Fran?ois

[1]
https://github.com/enovance/puppet-ceph/blob/master/manifests/osd/device.pp#L44