Slow requests when deleting rbd snapshots

Eino Tuominen <eino@xxxxxx> · Sat, 04 Jul 2015 08:18:50 +0000

Hello,

We are running 0.80.5 on our production cluster and we are seeing slow requests when deleting rbd snapshots. We have now reduced snapshot counts to 4 weeklies but it seems that the snapshot count is not a factor of this problem. The cluster is practically unresponsive so long that clients timeout.

Here are top ten slowest requests per osd from last night (times in seconds):

1	/var/log/ceph/ceph-osd.46.log	1920
2	/var/log/ceph/ceph-osd.42.log	1455
3	/var/log/ceph/ceph-osd.74.log	1292
4	/var/log/ceph/ceph-osd.77.log	1170
5	/var/log/ceph/ceph-osd.48.log	1083
6	/var/log/ceph/ceph-osd.0.log	960
7	/var/log/ceph/ceph-osd.40.log	960
8	/var/log/ceph/ceph-osd.57.log	960
9	/var/log/ceph/ceph-osd.61.log	960
10	/var/log/ceph/ceph-osd.76.log	960

Some OSDs don't report slow requests at all,  they are not evenly distributed.

Currently we run journals on the osd sata drives, but are considering upgrading to SSD journals. However, we do not have any performance problems other than when deleting snapshots.

Is there any way to mitigate the problem other than investing on SSD journals?

-- 
  Eino Tuominen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com