osd crash: trim_objectcould not find coid

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 19 Sep 2014 16:52:09 -0700

On Fri, Sep 19, 2014 at 2:35 AM, Francois Deppierraz <francois at ctrlaltdel.ch
> wrote:

> Hi Craig,
>
> I'm planning to completely re-install this cluster with firefly because
> I started to see other OSDs crashes with the same trim_object error...
>

I did lose data because of this, but it was unrelated to the XFS issues.
 Luckily, it was only RGW replication state, and not something more
important.

I was having issues with OSDs crashing.  I'd mark them out, and the problem
would move to a new OSD.  I tried using the patch in
http://tracker.ceph.com/issues/6101.  It worked, but only as long as I ran
the patch.  When I went back to a stock binary, it started crashing again.
 It also spammed the logs with warnings instead of crashing.

The problem PG was in my RGW .$zone.log pool.  It's small, so I pulled all
of the objects out of the pool, recreated the pool, and uploaded the
objects again.  It messed up my replication state, so I'm still sorting
that out.

It appears to me that the code fix in Firefly (
http://tracker.ceph.com/issues/7595) will prevent the problem from
happening, but not correct an already corrupted store.  I dropped all my
snapshots, and disabled new ones, until I can complete the upgrade.

Rebuilding on FireFly should solve your problem.

>
> So now, I'm more interested in figuring out exactly why data corruption
> happened in the first place than repairing the cluster.
>

I'm not entirely sure from reading http://tracker.ceph.com/issues/7595, but
it looks like occasionally creating a snapshot doesn't save the correct
information.  Then when removing the snapshot, it gets confused and asserts.

>
> Comments in-line.
>
>
> >
> > This is a problem.  It's not necessarily a deadlock.  The warning is
> > printed if the XFS memory allocator has to retry more than 100 times
> > when it's trying to allocate memory.  It either indicates extremely low
> > memory, or extremely fragmented memory.  Either way, your OSDs are
> > sitting there trying to allocate memory instead of doing something
> useful.
>
> Do you mean that this particular error doesn't imply data corruption but
> only bad OSD performances?
>

That was my experience.  That cluster was pretty much unusable, but I was
able to access all of my data once I got the cluster healthy.

> > By any chance, does your ceph.conf have:
> > osd mkfs options xfs = -n size=64k
> >
> > If so, you should start planning to remove that arg, and reformat every
> > OSD.  Here's a thread where I discussion my (mis) adventures with XFS
> > allocation deadlocks:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041336.html
>
> Yes! Thanks for the details, I'm actually using the puppet-ceph module
> from enovance which indeed uses [1] the '-n size=64k' option when
> formating a new disk.
>

I would avoid that option when you rebuild your cluster.  There is a fix in
the 3.14 kernels, but it's not really necessary.  That option makes the
inodes larger, which should make directories with millions of files in them
a bit faster.  None of my PGs have more than 10 files in a directory.
 Every time a directory gets more than a few files in it, Ceph creates some
subdirectories, and splits things up.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140919/d3069108/attachment.htm>