Re: blocked i/o on rbd device

Randy Orr <randy.orr@xxxxxxxxxx> · Wed, 2 Mar 2016 06:58:38 -0600

Ilya,
That's great, thank you. I will certainly try the updated kernel when available. Do you have pointers to the two bugs in question? 

Jan,

We have tried nfs export as both sync and async and have seen the issue using both options. I have seen this on hosts with 24G of mem and 128G of mem.

Thanks,
Randy

On Wednesday, March 2, 2016, Jan Schermer <jan@xxxxxxxxxxx> wrote:
Are you exporting (or mounting) the NFS as async or sync?

How much memory does the server have?

Jan

> On 02 Mar 2016, at 12:54, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote:

>

> Ilya,

>

>> We've recently fixed two major long-standing bugs in this area.

>

> If you could elaborate more, it would be reasonable for the community.

> Is there any pointer?

>

> Cheers,

> Shinobu

>

> ----- Original Message -----

> From: "Ilya Dryomov" <idryomov@xxxxxxxxx>

> To: "Randy Orr" <randy.orr@xxxxxxxxxx>

> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

> Sent: Wednesday, March 2, 2016 8:40:42 PM

> Subject: Re:  blocked i/o on rbd device

>

> On Tue, Mar 1, 2016 at 10:57 PM, Randy Orr <randy.orr@xxxxxxxxxx> wrote:

>> Hello,

>>

>> I am running the following:

>>

>> ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

>> ubuntu 14.04 with kernel 3.19.0-49-generic #55~14.04.1-Ubuntu SMP

>>

>> For this use case I am mapping and mounting an rbd using the kernel client

>> and exporting the ext4 filesystem via NFS to a number of clients.

>>

>> Once or twice a week we've seen disk io "stuck" or "blocked" on the rbd

>> device. When this happens iostat shows avgqu-sz at a constant number with

>> utilization at 100%. All i/o operations via NFS blocks, though I am able to

>> traverse the filesystem locally on the nfs server and read/write data. If I

>> wait long enough the device will eventually recover and avgqu-sz goes to

>> zero.

>>

>> The only issue I could find that was similar to this is:

>> http://tracker.ceph.com/issues/8818 - However, I am not seeing the error

>> messages described and I am running a more recent version of the kernel that

>> should contain the fix from that issue. So, I assume this is likely a

>> different problem.

>>

>> The ceph cluster reports as healthy the entire time, all pgs up and in,

>> there was no scrubbing going on, no osd failures or anything like that.

>>

>> I ran echo t > /proc/sysrq-trigger and the output is here:

>> https://gist.github.com/anonymous/89c305443080149e9f45

>>

>> Any ideas on what could be going on here? Any additional information I can

>> provide?

>

> Hi Randy,

>

> We've recently fixed two major long-standing bugs in this area.

> Currently, the only kernel that has fixes for both is 4.5-rc6, but

> backports are on their way - both patches will be 4.4.4.  I'll make

> sure those patches are queued for the ubuntu 3.19 kernel as well, but

> it'll take some time for them to land.

>

> Could you try either 4.5-rc6 or 4.4.4 after it comes out?  It's likely

> that your problem is fixed.

>

> Thanks,

>

>                Ilya

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com