On 11/20/2015 09:46 PM, Mike Snitzer wrote: > On Thu, Nov 19 2015 at 10:14am -0500, > vaLentin chernoZemski <valentin@xxxxxxxxxxxxxx> wrote: > >> Hi folks, >> >> It seems that there is a bug in the linux kernel in any release from >> >> - 2.6.32-573.3.1.el6.x86_64 - crash >> - 3.12.49 + msg00123 patch - crash / D state >> - 4.1.6 - lv* operations in D state after bug is hit >> - 4.1.12 + f11a82caf / b0dc3c8bc15 - lv* operations in D state >> after bug is hit >> - 4.2.5 - lv* operations in D state after bug is hit >> - 4.3.0-rc7-vanilla1 >> >> The bug is described in details and stack traces in RedHat's >> bugzilla under id 1219634: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1219634 >> >> For some reason it is marked as private but I guess you have access >> to this one. >> >> Issue is present in current latest RHEL version and all vanilla >> kernels I tested with multiple patches specified in the bug. >> >> Even I can not provide you with exact reproducer it happens often >> enough on a fleet of machines we have that perform certain tasks and >> we can easily test new patches or provide you with specific >> information upon request from all crash dumps we reliably collected >> and still collecting from all kernel versions tested. >> >> I got advised by Mike Snitzer to dm-devel so here it is. >> >> Let us know if there is anything we can do to assist you further. > > As you know we've already had further exchanges off-list (started prior > to you having sent this mail to dm-devel). > > But for the benefit of others; here are some additional details not > covered above: > - you have a pretty extensive multi-system setup that is seeing these > thinp metadata corruptions manifest as a BUG_ON in bufio.c > - my theory is that even though we've fixed bugs in persistent-data that > will likely prevent future corruption on-disk you could easily have > on-disk corruption that even the new code cannot cope with. > - it isn't productive for the persistent-data code to immediately BUG_ON > in the face of this corruption > - because the kernel code just does BUG_ON you're having a hard time > identifying which thin-pool is hitting problems across your cluster > > So in summary, we need 2 improvements moving forward: > 1) the kernel code should bubble errors out to the edges; the error > should cause the pool to transition to read-only mode (w/ needs_check > flag set) -- a side-effect of this is we'll get logging of which > thin-pool metadata device(s) saw the corruption > > 2) we need lvm2 to simplify direct access to the pool's metadata volume > to assist with more advanced troubleshooting (e.g. creating a > compressed copy of the thin-pool metadata device that we can analyze) Hello Mike, Sorry for taking so long to get back you. I have tested our in-house reproducer with https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.4&id=ed8b45a3679eb49069b094c0711b30833f27c734 applied and can confirm that with this patch the kernel no longer crashes whereas without it - it does. So indeed the aforementioned patch fixes the issue. You can add Tested-by: Nikolay Borisov <kernel@xxxxxxxx> On a different note, are you still interested in acquiring the image we used to reproduce the issue? If so maybe we should liaise off-list to get it to you? Regards, Nikolay > > Mike > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel