osd crash: trim_objectcould not find coid

francois@xxxxxxxxxxxxx (Francois Deppierraz) · Mon, 08 Sep 2014 10:42:32 +0200

Hi,

This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2
under Ubuntu 12.04. The cluster was filling up (a few osds near full)
and I tried to increase the number of pg per pool to 1024 for each of
the 14 pools to improve storage space balancing. This increase triggered
high memory usage on the servers which were unfortunately
under-provisioned (16 GB RAM for 22 osds) and started to swap and crash.

After installing memory into the servers, the result is a broken cluster
with unfound objects and two osds (osd.6 and osd.43) crashing at startup.

$ ceph health
HEALTH_WARN 166 pgs backfill; 326 pgs backfill_toofull; 2 pgs
backfilling; 765 pgs degraded; 715 pgs down; 1 pgs incomplete; 715 pgs
peering; 5 pgs recovering; 2 pgs recovery_wait; 716 pgs stuck inactive;
1856 pgs stuck unclean; 164 requests are blocked > 32 sec; recovery
517735/15915673 objects degraded (3.253%); 1241/7910367 unfound
(0.016%); 3 near full osd(s); 1/43 in osds are down; noout flag(s) set

osd.6 is crashing due to an assertion ("trim_objectcould not find coid")
which leads to a resolved bug report which unfortunately doesn't give
any advise on how to repair the osd.

http://tracker.ceph.com/issues/5473

It is much less obvious why osd.43 is crashing, please have a look at
the following osd logs:

http://paste.ubuntu.com/8288607/
http://paste.ubuntu.com/8288609/

Any advise on how to repair both osds and recover the unfound objects
would be more than welcome.

Thanks!

Fran?ois