Re: File system corruption

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 12 Oct 2012 08:07:41 +1100

On Thu, Oct 11, 2012 at 12:52:58PM -0500, Wayne Walker wrote:
> In short, I am able to:  mkfs...; mount...; cp 1gbfile...; sync; cp
> 1gbfile...; sync  # and now the xfs is corrupt
> 
> I see multiple bugs
> 
> 1. very simple, non-corner-case actions create a corrupted file system
> 2. corrupt data is knowingly written to the file system.
> 3. the file system stays online and writable
> 4. future write operations to the file system return success.
> 
> Details:
.....

Nothing unusual there in the hardware. Seems sane to me.

> The exact commands to create the failure:
> 
> /sbin/mkfs.xfs -f -l logdev=/dev/sda5 -b size=4096 -d su=1024k,sw=4
> /dev/sde1
> cat /etc/fstab
> mount -t xfs -o defaults,noatime,logdev=/dev/sda5 /dev/sde1 /dtfs_data/data1
> cp random_data.1G /dtfs_data/data1
> # returns 0
> sync
> # file system reported no failure yet
> cp random_data.1G /dtfs_data/data1
> # returns 0
> sync
> # file system reports stack trace, bad agf, and page discard

Ok, so having looked at the stack trace, the AGF block taht was read
contained zeros, not valid metadata, which is why the allocation
failed.

Can you remake the filesystem at will? If so, can you run mkfs.xfs
as per above, then run the following command?

# echo 3 > /proc/sys/vm/drop_caches
# for i in `seq 0 4`; do
> xfs_db -l /dev/sda5 -c "sb $i" -c p -c "agf $i" -c p /dev/sde1
> done

So that we can see what mkfs put on disk? Can you then mount the
filesystem, unmount it again, and run the same commands? Then mount
the filesystem, run the copy/sync to trigger the error, then unmount
and run the commands again?

What I'm interested in if whether xfs_db sees the AGF (which ever
one it is) as zero, or whether only the kernel is seeing that.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs