On Sun, Dec 8, 2013 at 7:16 AM, Oliver Schulz <oschulz@xxxxxxxxxx> wrote: > Hello Ceph-Gurus, > > a short while ago I reported some trouble we had with our cluster > suddenly going into a state of "blocked requests". > > We did a few tests, and we can reproduce the problem: > During / after deleting of a substantial chunk of data on > CephFS (a few TB), ceph health shows blocked requests like > > HEALTH_WARN 222 requests are blocked > 32 sec > > This goes on for a couple of minutes, during which the cluster is > pretty much unusable. The number of blocked requests jumps around > (but seems to go down on average), until finally (after about 15 > minutes in my last test) health is back to OK. > > I upgraded the cluster to Ceph emperor (0.72.1) and repeated the > test, but the problem persists. > > Is this normal - and if not, what might be the reason? Obviously, > having the cluster go on strike for a while after data deletion > is a bit of a problem, especially with a mixed application load. > The VM's running on RBDs aren't too happy about it, for example. ;-) Nobody's reported it before, but I think the CephFS MDS is sending out too many delete requests. When you delete something in CephFS, it's just marked as deleted and the MDS is supposed to do so asynchronously in the background, but I'm not sure if there are any throttles on how quickly it does so. If you remove several terabytes worth of data, and the MDS is sending out RADOS object deletes for each 4MB as fast as it can, that's a lot of unfiltered traffic on the OSDs. That's all speculation on my part though; can you go sample the slow requests and see what their makeup looked like? Do you have logs from the MDS or OSDs during that time period? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com