blocked i/o on rbd device

Randy Orr <randy.orr@xxxxxxxxxx> · Tue, 1 Mar 2016 15:57:52 -0600

Hello,
I am running the following:

ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
ubuntu 14.04 with kernel 3.19.0-49-generic #55~14.04.1-Ubuntu SMP

For this use case I am mapping and mounting an rbd using the kernel client and exporting the ext4 filesystem via NFS to a number of clients. 

Once or twice a week we've seen disk io "stuck" or "blocked" on the rbd device. When this happens iostat shows avgqu-sz at a constant number with utilization at 100%. All i/o operations via NFS blocks, though I am able to traverse the filesystem locally on the nfs server and read/write data. If I wait long enough the device will eventually recover and avgqu-sz goes to zero. 

The only issue I could find that was similar to this is: http://tracker.ceph.com/issues/8818 - However, I am not seeing the error messages described and I am running a more recent version of the kernel that should contain the fix from that issue. So, I assume this is likely a different problem. 

The ceph cluster reports as healthy the entire time, all pgs up and in, there was no scrubbing going on, no osd failures or anything like that.

I ran echo t > /proc/sysrq-trigger and the output is here: https://gist.github.com/anonymous/89c305443080149e9f45

 Any ideas on what could be going on here? Any additional information I can provide?

Thanks,
Randy Orr

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com