locked up cluster while recovering OSD

Ludovico Cavedon <cavedon@xxxxxxxxxxxx> · Sun, 25 Oct 2015 14:23:53 -0700

Hi,
we have a Ceph cluster with:
- 12 OSDs on 6 physical nodes, 64 GB RAM
- each OSD has a 6 TB spinning disk and a 10GB journal in ram (tmpfs) [1]
- 3 redundant copies
- 25% space usage so far
- ceph 0.94.2.
- store data via radosgw, using sharded bucket indexes (64 shards).
- 500 PGs per node (as we are planning on scaling the number of nodes without adding more pools in the future).

We currently have a constant write load (about 60 PUTs per second of small objects, usually a few KB, but sometimes they can go up to a few MB).

If I restart an OSD, it seems that most operations get stuck for up to multiple minutes until the OSD is done recovering.
(noout is set, but I understand it does not matter because the the OSD is down for less than 5 minutes).

Most of the "slow operation" messages had the following reasons:
- currently waiting for rw locks
- currently waiting for missing object
- currently waiting for degraded object

And were:
- [call rgw.bucket_prepare_op] ... ondisk+write+known_if_redirected
- [call rgw.bucket_complete_op] ... ondisk+write+known_if_redirected

operating mostly on the bucket index shard objects.

The monitors and gateways look completely unloaded.
On the other side it looks like the IO on the OSDs is very intense (average disk write completion time is 300 ms) and the disk IO utilization is around 50%.

It looks to me the storage layer needs to be improved (RAID controller with big write-back cache maybe?).
However I do not understand exactly what is going wrong here.
I would expect that the operations keep being served  as before either writing to the primary PG  or to the replica, and the PGs would recover in the background.
Do you have any ideas?
What path would you follow to understand what the problem is?
I am happy to provide more logs if that helps.

Thanks in advance for any help,
Ludovico

[1] We had to disable filestore_fadivse, otherwise two threads per OSD would get stuck on 100% CPU moving pages from ram (presumably the journal) to the swap.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com