Re: Behaviour of a cluster with full OSD(s)

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Tue, 23 Dec 2014 11:48:47 -0800

On Tue, Dec 23, 2014 at 3:34 AM, Max Power <maillists@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
I understand that the status "osd full" should never be reached. As I am new to

ceph I want to be prepared for this case. I tried two different scenarios and

here are my experiences:

For a real cluster, you should be monitoring your cluster, and taking immediate action once you get an OSD in nearfull state.  Waiting until OSDs are toofull is too late.

For a test cluster, it's a great learning experience. :-)

The first one is to completely fill the storage (for me: writing files to a

rados blockdevice). I discovered that the writing client (dd for example) gets

completly stucked then. And this prevents me from stoping the process (SIGTERM,

SIGKILL). At the moment I restart the whole computer to prevent writing to the

cluster. Then I unmap the rbd device and set the full ratio a bit higher (0.95

to 0.97). I do a mount on my adminnode and delete files till everything is okay

again.

Is this the best practice? 

It is a design feature of Ceph that all cluster reads and writes stop until the toofull situation is resolved.

The route you took is one of two ways to recover.  The other route you found in your replica test.

Is it possible to prevent the system from running in

a "osd full" state? I could make the block devices smaller than the cluster can

save. But it's hard to calculate this exactly.

If you continue to add data to the cluster after it's nearfull, then you're going to hit toofull.
Once you hit nearfull, you need to delete existing data, or add more OSDs. 

You've probably noticed that some OSDs are using more space than others.  You can try to even them out with `ceph osd reweight` or `ceph osd crush reweight`, but that's a delaying tactic.  When I hit nearfull, I place an order for new hardware, then use `ceph osd reweight` until it arrives.

The next scenario is to change a pool size from say 2 to 3 replicas. While the

cluster copies the objects it gets stuck as an osd reaches it limit. Normally

the osd process quits then and I cannot restart it (even after setting the

replicas back). The only possibility is to manually delete complete PG folders

after exploring them with 'pg dump'. Is this the only way to get it back working

again?

There are some other configs that might have come into play here.  You might have run into osd_failsafe_nearfull_ratio or osd_failsafe_full_ratio.  You could try bumping those up a bit, and see if that lets the process stay up long enough to start reducing replicas.

Since osd_failsafe_full_ratio is already 0.97, I wouldn't take it any higher than 0.98.  Ceph triggers on "greater-than" percentages, so 0.99 will let you fill a disk to 100% full.  If you get a disk to 100% full, the only way to cleanup is to start deleting PG directories.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com