Re: lvremove kernel BUG at drivers/md/dm-bufio.c:1494!

Nikolay Borisov <n.borisov@xxxxxxxxxxxxxx> · Sat, 12 Dec 2015 11:21:46 +0200

On 11/20/2015 09:46 PM, Mike Snitzer wrote:
> On Thu, Nov 19 2015 at 10:14am -0500,
> vaLentin chernoZemski <valentin@xxxxxxxxxxxxxx> wrote:
> 
>> Hi folks,
>>
>> It seems that there is a bug in the linux kernel in any release from
>>
>>  - 2.6.32-573.3.1.el6.x86_64 - crash
>>  - 3.12.49 + msg00123 patch - crash / D state
>>  - 4.1.6 - lv* operations in D state after bug is hit
>>  - 4.1.12 + f11a82caf / b0dc3c8bc15 - lv* operations in D state
>> after bug is hit
>>  - 4.2.5 - lv* operations in D state after bug is hit
>>  - 4.3.0-rc7-vanilla1
>>
>> The bug is described in details and stack traces in RedHat's
>> bugzilla under id 1219634:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1219634
>>
>> For some reason it is marked as private but I guess you have access
>> to this one.
>>
>> Issue is present in current latest RHEL version and all vanilla
>> kernels I tested with multiple patches specified in the bug.
>>
>> Even I can not provide you with exact reproducer it happens often
>> enough on a fleet of machines we have that perform certain tasks and
>> we can easily test new patches or provide you with specific
>> information upon request from all crash dumps we reliably collected
>> and still collecting from all kernel versions tested.
>>
>> I got advised by Mike Snitzer to dm-devel so here it is.
>>
>> Let us know if there is anything we can do to assist you further.
> 
> As you know we've already had further exchanges off-list (started prior
> to you having sent this mail to dm-devel).
> 
> But for the benefit of others; here are some additional details not
> covered above:
> - you have a pretty extensive multi-system setup that is seeing these
>   thinp metadata corruptions manifest as a BUG_ON in bufio.c
> - my theory is that even though we've fixed bugs in persistent-data that
>   will likely prevent future corruption on-disk you could easily have
>   on-disk corruption that even the new code cannot cope with.
> - it isn't productive for the persistent-data code to immediately BUG_ON
>   in the face of this corruption
> - because the kernel code just does BUG_ON you're having a hard time
>   identifying which thin-pool is hitting problems across your cluster
> 
> So in summary, we need 2 improvements moving forward:
> 1) the kernel code should bubble errors out to the edges; the error
>    should cause the pool to transition to read-only mode (w/ needs_check
>    flag set) -- a side-effect of this is we'll get logging of which
>    thin-pool metadata device(s) saw the corruption
> 
> 2) we need lvm2 to simplify direct access to the pool's metadata volume
>    to assist with more advanced troubleshooting (e.g. creating a
>    compressed copy of the thin-pool metadata device that we can analyze)

Hello Mike,

Sorry for taking so long to get back you. I have tested our in-house
reproducer with
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.4&id=ed8b45a3679eb49069b094c0711b30833f27c734

applied and can confirm that with this patch the kernel no longer
crashes whereas without it - it does. So indeed the aforementioned patch
fixes the issue. You can add

Tested-by: Nikolay Borisov <kernel@xxxxxxxx>

On a different note, are you still interested in acquiring the image we
used to reproduce the issue? If so maybe we should liaise off-list to
get it to you?

Regards,
Nikolay

> 
> Mike
> 

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel