health_err on osd full

james.eckersall@xxxxxxxxx (James Eckersall) · Fri, 18 Jul 2014 23:15:54 +0100

Hi,

I have a ceph cluster running on 0.80.1 with 80 OSD's.

I've had fairly uneven distribution of the data and have been keeping it
ticking along with "ceph osd reweight XX 0.x" commands on a few OSD's while
I try and increase the pg count of the pools to hopefully better balance
the data.

Tonight, one of the OSD's filled up to 95% so was marked as "full".

This caused the cluster to be flagged as "full" and the server mapping the
rbd's hit a nice loadavg of over 800.  This was rebooted and I was unable
to map any rbd's.
I've tweaked the reweight of the "full" OSD down and that is now "near
full".
As soon as that OSD changed state to "near full", the cluster changed
status to HEALTH_WARN and I'm able to map rbd's again.

I was of the opinion that a full OSD would just prevent data from being
written to that OSD, not the near catastrophic cluster unavailability that
I've experienced.

The cluster is around 65% full of data, so there is really plenty of space
across other OSD's.

Can anyone please clarify exactly whether this behaviour is normal?

Regards

J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140718/2093f03c/attachment.htm>