Hello,
I am running the following:
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
ubuntu 14.04 with kernel 3.19.0-49-generic #55~14.04.1-Ubuntu SMP
For this use case I am mapping and mounting an rbd using the kernel client and exporting the ext4 filesystem via NFS to a number of clients.
Once or twice a week we've seen disk io "stuck" or "blocked" on the rbd device. When this happens iostat shows avgqu-sz at a constant number with utilization at 100%. All i/o operations via NFS blocks, though I am able to traverse the filesystem locally on the nfs server and read/write data. If I wait long enough the device will eventually recover and avgqu-sz goes to zero.
The only issue I could find that was similar to this is: http://tracker.ceph.com/issues/8818 - However, I am not seeing the error messages described and I am running a more recent version of the kernel that should contain the fix from that issue. So, I assume this is likely a different problem.
The ceph cluster reports as healthy the entire time, all pgs up and in, there was no scrubbing going on, no osd failures or anything like that.
I ran echo t > /proc/sysrq-trigger and the output is here: https://gist.github.com/anonymous/89c305443080149e9f45
Any ideas on what could be going on here? Any additional information I can provide?
Thanks,
Randy Orr
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com