Re: disk timeouts in libvirt/qemu VMs...

Jason Dillaman <jdillama@xxxxxxxxxx> · Fri, 31 Mar 2017 09:44:42 -0400

The exclusive-lock feature should only require grabbing the lock on
the very first IO, so if this is an issue that pops up after extended
use, it's either most likely not related to exclusive-lock or perhaps
you had a client<->OSD link hiccup. In the latter case, you will see a
log message like "image watch failed" in your logs.

Since this isn't something that we have run into during our regular
testing, I would greatly appreciate if someone could capture a "gcore"
dump from a running but stuck process and use "ceph-post-file" to
provide us with the dump (along with versions of installed RPMs/DEBs
so we can configure the proper debug symbols).

On Thu, Mar 30, 2017 at 7:18 AM, Peter Maloney
<peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
> On 03/28/17 17:28, Brian Andrus wrote:
>> Just adding some anecdotal input. It likely won't be ultimately
>> helpful other than a +1..
>>
>> Seemingly, we also have the same issue since enabling exclusive-lock
>> on images. We experienced these messages at a large scale when making
>> a CRUSH map change a few weeks ago that resulted in many many VMs
>> experiencing the blocked task kernel messages, requiring reboots.
>>
>> We've since disabled on all images we can, but there are still
>> jewel-era instances that cannot have the feature disabled. Since
>> disabling the feature, I have not observed any cases of blocked tasks,
>> but so far given the limited timeframe I'd consider that anecdotal.
>>
>>
>
> Why do you need it enabled in jewel-era instances? With jewel you can
> set them on the fly, and live migrate the VM to get the client to update
> its usage of it.
>
> I couldn't find any difference except removing big images is faster with
> object-map (which depends on exclusive-lock). So I can't imagine why it
> can be required.
>
> And how long did you test it? I tested it a few weeks ago for about a
> week, with no hangs. Normally there are hangs after a few days. And I
> have permanently disabled it since the 20th, without any hangs since.
> And I'm gradually adding back the VMs that died when they were there,
> starting with the worst offenders. With that small time, I'm still very
> convinced.
>
> And did you test other features? I suspected exclusive-lock, so I only
> tested removing that one, which required removing object-map and
> fast-diff too, so I didn't test those 2 separately.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com