https://bugzilla.kernel.org/show_bug.cgi?id=102731 --- Comment #14 from John Hughes <john@xxxxxxxxx> --- On 28/09/15 19:06, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=102731 > > --- Comment #13 from Theodore Tso <tytso@xxxxxxx> --- > So it's been 12 days, and previously when you were using the Debian 3.16 > kernel, it was triggering once every four days, right? Can I assume that your > silence indicates that you haven't seen a problem to date? I haven't seen the problem, but unfortunately I'm running 3.18.19 at the moment (I screwed up on the last boot and let it boot the default kernel). I haven't had time to reboot. So I'd like to give it a bit more time. > > If so, then it really does seen that it might be an interaction between LVM/MD > and KVM. > > So if that's the case, then the next thing to ask is to try to figure out what > might be the triggering cause. A couple of things come to mind: > > 1) Some failure to properly handle a flush cache command being sent to the MD > device. This combined to either a power failure or a crash of the guest OS > (depending on how KVM is configured), might explain a block update getting > lost. The fact that the block bitmap is out of sync with the block group > descriptor is consistent with this failure. However, if you were seeing > failures once every four days, that would imply that the guest OS and/or host > OS would be crashing at that or about that level of frequency, and you haven't > reported that. I haven't had any host or guest crashes. > > 2) Some kind a race between a 4k write and a RAID1 resync leading to a block > write getting lost. Again, this reported data corruption is consistent with > this theory --- but this also requires the guest OS crashing due to some kind > of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as > in (1) above. If you weren't seeing these failures once every four days or so, > then this isn't a likely explanation. No crashes. > > 3) Some kind of corruption caused by the TRIM command being sent to the > RAID/MD device, possibly racing with a block bitmap update. This could be > caused either by the file system being mounted with the -o discard mount > option, or by fstrim getting run out of cron, or by e2fsck explicitly being > asked to discard unused blocks (with the "-E discard" option). I'm not using "-o discard", or fstrim, I've never used the "-E discard" option to fsck. > > 4) Some kind of bug which happens rarely either in qemu, the host kernel or > the guest kernel depending on how it communicates with the virtual disk. > (i.e., virtio, scsi, ide, etc.) Virtio is the most likely use case, and so > trying to change to use scsi emulation might be interesting. (OTOH, if the > problem is specific to the MD layer, then this possibility is less likely.) > > So as far as #3 is concerned, can you check to see if you had fstrim enabled, > or are mounting the file system with -o discard? > I'm a bit overwhelmed with work at the moment so I haven't had time to read this message with the care it deserves, I'll get back to you with more detail next week. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html