On Thu, Apr 10, 2014 at 10:03:16AM -0400, Theodore Ts'o wrote: > On Thu, Apr 10, 2014 at 01:04:28AM -0400, Nathaniel W Filardo wrote: > >[snip] > What is your workload? Can you reproduce this easily? And can you > try using a local disk to see if the problem goes away, so we can > start to bisect which software components might be at fault? We're running an OpenAFS fileserver; the partition that most often causes us trouble is that which contains our mirrors (Debian, Fedora, etc.) which has ~5T of its 10T capacity filled. Every few days, at least, we trip over one of these. I will see what I can do about getting some local storage hooked up to the VM, but my fear is that this is related to the size of the volume and the amount of data therein. If that's true, any amount of local storage I could muster will not shake this out. > I'm not aware of any corruption problem with a 3.13 based kernel which > matches your signature, and the ext4 errors that you are showing > (minor accounting discrepancies in the number free blocks and number > of free inodes between the allocation bitmap and the summary > statistics in the block group descriptors) is very closely matches the > signature of some part of the storage stack not honoring FLUSH CACHE > ("barrier") operations, either by ignoring them completely, or > reordring writes across a barrier / flush cache request. Shouldn't cache reordering or fail to flush correctly only matter if the machine is crashing or otherwise losing power? I suppose it's possible there's a bug that would cause the cache to fail to write a block at all, rather than simply "too late". But as I said before, we've not had any crashes or otherwise lost uptime anywhere: host, guest, storage providers, etc. That said, we do occasionally, though much less often than we get reports of corrupted metadata, get messages that I don't know how to decode from the ATA stack (though naively they all seemed to be successfully resolved transients)? One of our VMs, nearly identically configured, though not the one that's been reporting corruption on its filesystem, spat this out the other day, for example: [532625.888251] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [532625.888762] ata1.00: failed command: FLUSH CACHE [532625.889128] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [532625.889128] res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (time out) [532625.889945] ata1.00: status: { DRDY } [532630.928064] ata1: link is slow to respond, please be patient (ready=0) [532635.912178] ata1: device not ready (errno=-16), forcing hardreset [532635.912220] ata1: soft resetting link [532636.070087] ata1.00: configured for MWDMA2 [532636.070701] ata1.01: configured for MWDMA2 [532636.070705] ata1.00: retrying FLUSH 0xe7 Emask 0x4 [532651.068208] ata1.00: qc timeout (cmd 0xe7) [532651.068216] ata1.00: FLUSH failed Emask 0x4 [532651.236146] ata1: soft resetting link [532651.393918] ata1.00: configured for MWDMA2 [532651.394533] ata1.01: configured for MWDMA2 [532651.394537] ata1.00: retrying FLUSH 0xe7 Emask 0x4 [532651.395550] ata1.00: device reported invalid CHS sector 0 [532651.395564] ata1: EH complete This appears to have been during a stall in Ceph's ability to write data to the OSDs. I don't know what caused that, and the machine has been happy ever since (though maybe it's all about to go south?). I'll keep my eyes open, especially for any reports of problems between fscking and the blockmap corruption being detected. Would I be better off attaching the RBDs via virtio than achi? Do I need to do anything specially on the guest to make that work? Thanks very much, again, --nwf;
Attachment:
pgpJyEhlMukC4.pgp
Description: PGP signature