osd crash: trim_objectcould not find coid

greg@xxxxxxxxxxx (Gregory Farnum) · Mon, 8 Sep 2014 11:20:10 -0700

On Mon, Sep 8, 2014 at 1:42 AM, Francois Deppierraz
<francois at ctrlaltdel.ch> wrote:
> Hi,
>
> This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2
> under Ubuntu 12.04. The cluster was filling up (a few osds near full)
> and I tried to increase the number of pg per pool to 1024 for each of
> the 14 pools to improve storage space balancing. This increase triggered
> high memory usage on the servers which were unfortunately
> under-provisioned (16 GB RAM for 22 osds) and started to swap and crash.
>
> After installing memory into the servers, the result is a broken cluster
> with unfound objects and two osds (osd.6 and osd.43) crashing at startup.
>
> $ ceph health
> HEALTH_WARN 166 pgs backfill; 326 pgs backfill_toofull; 2 pgs
> backfilling; 765 pgs degraded; 715 pgs down; 1 pgs incomplete; 715 pgs
> peering; 5 pgs recovering; 2 pgs recovery_wait; 716 pgs stuck inactive;
> 1856 pgs stuck unclean; 164 requests are blocked > 32 sec; recovery
> 517735/15915673 objects degraded (3.253%); 1241/7910367 unfound
> (0.016%); 3 near full osd(s); 1/43 in osds are down; noout flag(s) set
>
> osd.6 is crashing due to an assertion ("trim_objectcould not find coid")
> which leads to a resolved bug report which unfortunately doesn't give
> any advise on how to repair the osd.
>
> http://tracker.ceph.com/issues/5473
>
> It is much less obvious why osd.43 is crashing, please have a look at
> the following osd logs:
>
> http://paste.ubuntu.com/8288607/
> http://paste.ubuntu.com/8288609/

The first one is not caused by the same thing as the ticket you
reference (it was fixed well before emperor), so it appears to be some
kind of disk corruption.
The second one is definitely corruption of some kind as it's missing
an OSDMap it thinks it should have. It's possible that you're running
into bugs in emperor that were fixed after we stopped doing regular
support releases of it, but I'm more concerned that you've got disk
corruption in the stores. What kind of crashes did you see previously;
are there any relevant messages in dmesg, etc?

Given these issues, you might be best off identifying exactly which
PGs are missing, carefully copying them to working OSDs (use the osd
store tool), and killing these OSDs. Do lots of backups at each
stage...
-Greg