Re: Ext4 corruption with VM images as 3 > drop_caches

Jan Kara <jack@xxxxxxx> · Fri, 20 Mar 2020 12:49:40 +0100

On Fri 20-03-20 11:04:50, Ritesh Harjani wrote:
> On 3/19/20 6:54 PM, Ritesh Harjani wrote:
> > On 3/18/20 9:17 AM, Aneesh Kumar K.V wrote:
> > > Hi,
> > > 
> > > With new vm install I am finding corruption with the vm image if I
> > > follow up the install with echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > The file system reports below error.
> > > 
> > > Begin: Running /scripts/local-bottom ... done.
> > > Begin: Running /scripts/init-bottom ...
> > > [    4.916017] EXT4-fs error (device vda2): ext4_lookup:1700: inode
> > > #787185: comm sh: iget: checksum invalid
> > > done.
> > > [    5.244312] EXT4-fs error (device vda2): ext4_lookup:1700: inode
> > > #917954: comm init: iget: checksum invalid
> > > [    5.257246] EXT4-fs error (device vda2): ext4_lookup:1700: inode
> > > #917954: comm init: iget: checksum invalid
> > > /sbin/init: error while loading shared libraries: libc.so.6: cannot
> > > open shared object file: Error 74
> > > [    5.271207] Kernel panic - not syncing: Attempted to kill init!
> > > exitcode=0x00007f00
> > > 
> > > And debugfs reports
> > > 
> > > debugfs:  stat <917954>
> > > Inode: 917954   Type: bad type    Mode:  0000   Flags: 0x0
> > > Generation: 0    Version: 0x00000000
> > > User:     0   Group:     0   Size: 0
> > > File ACL: 0
> > > Links: 0   Blockcount: 0
> > > Fragment:  Address: 0    Number: 0    Size: 0
> > > ctime: 0x00000000 -- Wed Dec 31 18:00:00 1969
> > > atime: 0x00000000 -- Wed Dec 31 18:00:00 1969
> > > mtime: 0x00000000 -- Wed Dec 31 18:00:00 1969
> > > Size of extra inode fields: 0
> > > Inode checksum: 0x00000000
> > > BLOCKS:
> > > debugfs:
> > > 
> > > Bisecting this finds
> > > Commit 244adf6426ee31a83f397b700d964cff12a247d3("ext4: make
> > > dioread_nolock the default")
> > > as bad. If I revert the same on top of linus
> > > upstream(fb33c6510d5595144d585aa194d377cf74d31911)
> > > I don't hit the corrupttion anymore.
> > 
> > Tried replicating this and could easily replicate it on Power box.
> > I tried to reproduce this on x86 too, but could not reproduce on x86.
> > Now one difference on Power could be that pagesize is 64K and fs
> > blocksize is 4K.
> > 
> > The issue looks like the guest qemu image file is not properly written
> > back, after host does echo 3 > drop_caches. (correct me if this is not
> > the case).
> 
> Ok. So tried this issue with passing "cache=directsync" parameter to
> drive file. This parameter says it should bypass the host side page
> cache. With this parameter, I don't see this issue on Power box.

OK, so this likely means that there is something hosed in the writeback
path using unwritten extents when blocksize < pagesize. Maybe we miss some
conversion of unwritten extent to a written one and thus after dropping
caches we effectively loose data?

								Honza

> > I tried replicating via below test, but it could not reproduce.
> > 
> > Any idea what kind of unit test could be written for this?
> > I am not sure how exactly qemu is writing to it's image file.
> > 
> > 
> > 1. Create 2 files. "mmap-file", "mmap-data".
> > 2. "mmap-file" is a 2GB sparse file. Then at some random offsets (tried
> > with both 64KB align and 4KB align offsets), try to write
> > pagesize/blocksize amount of known data pattern.
> > 3. These offsets (which are pagesize/blocksize align) are recorded into
> > "mmap-data" file via normal read/write calls.
> > 4. Then after we wrote to both files, we munmap the "mmap-file" and
> > close both of these files.
> > 5. Then we do echo 3 > drop_caches.
> > 6. Then in the verify phase, using the offsets written in "mmap-data"
> > file, I read the "mmap-file" to verify if it's contents are proper or
> > not.
> > With that could not reproduce this issue.
> > 
> > 
> > -ritesh
> > 
> > 
> 
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR