[Bug 102731] I have a cough.

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Wed, 30 Sep 2015 09:49:21 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #14 from John Hughes <john@xxxxxxxxx> ---
On 28/09/15 19:06, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=102731
>
> --- Comment #13 from Theodore Tso <tytso@xxxxxxx> ---
> So it's been 12 days, and previously when you were using the Debian 3.16
> kernel, it was triggering once every four days, right?  Can I assume that your
> silence indicates that you haven't seen a problem to date?

I haven't seen the problem, but unfortunately I'm running 3.18.19 at the 
moment (I screwed up on the last boot and let it boot the default 
kernel).  I haven't had time to reboot.  So I'd like to give it a bit 
more time.
>
> If so, then it really does seen that it might be an interaction between LVM/MD
> and KVM.
>
> So if that's the case, then the next thing to ask is to try to figure out what
> might be the triggering cause.   A couple of things come to mind:
>
> 1) Some failure to properly handle a flush cache command being sent to the MD
> device.  This combined to either a power failure or a crash of the guest OS
> (depending on how KVM is configured), might explain a block update getting
> lost.   The fact that the block bitmap is out of sync with the block group
> descriptor is consistent with this failure.  However, if you were seeing
> failures once every four days, that would imply that the guest OS and/or host
> OS would be crashing at that or about that level of frequency, and you haven't
> reported that.

I haven't had any host or guest crashes.

>
> 2) Some kind a race between a 4k write and a RAID1 resync leading to a block
> write getting lost.  Again, this reported data corruption is consistent with
> this theory --- but this also requires the guest OS crashing due to some kind
> of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as
> in (1) above.  If you weren't seeing these failures once every four days or so,
> then this isn't a likely explanation.

No crashes.

>
> 3)  Some kind of corruption caused by the TRIM command being sent to the
> RAID/MD device, possibly racing with a block bitmap update.  This could be
> caused either by the file system being mounted with the -o discard mount
> option, or by fstrim getting run out of cron, or by e2fsck explicitly being
> asked to discard unused blocks (with the "-E discard" option).

I'm not using "-o discard", or fstrim, I've never used the "-E discard" 
option to fsck.
>
> 4)  Some kind of bug which happens rarely either in qemu, the host kernel or
> the guest kernel depending on how it communicates with the virtual disk.
> (i.e., virtio, scsi, ide, etc.)   Virtio is the most likely use case, and so
> trying to change to use scsi emulation might be interesting.  (OTOH, if the
> problem is specific to the MD layer, then this possibility is less likely.)
>
> So as far as #3 is concerned, can you check to see if you had fstrim enabled,
> or are mounting the file system with -o discard?
>

I'm a bit overwhelmed with work at the moment so I haven't had time to 
read this message with the care it deserves, I'll get back to you with 
more detail next week.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html