Re: Filesystem corruption after unreachable storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As the mail seems to have been trashed somewhere, I'll retry :)

Thanks
Jean-Louis


On 24/01/2020 21:37, Theodore Y. Ts'o wrote:
On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:
There was a short disruption of the SAN, which caused it to be unavailable
for 20-25 minutes for the ESXi.
20-25 minutes is "short"? I guess it depends on your definition / POV. :-)
Well more downtime was caused to recover (due to manual fsck) then the time the storage was down :)

What worries me is that almost all of the VM's (out of 500) were showing the
same error.
So that's a bit surprising...
Indeed, that's were I thought, something went wrong here!
I've tried to simulate it, and were able to simulate the same error when we let the san recover BEFORE VM is shutdown. If I poweroff the VM and then recover the SAN, it does an automatic fsck without problems.
So it really seems it breaks when the VM can write again to the SAN.

And even some (+-10) were completely corrupt.
What do you mean by "completely corrupt"? Can you send an e2fsck
transcript of file systems that were "completely corrupt"?
Well it was moving a tons of files to lost+found etc. So that was really broken.
I'll see if I can recover some backup of one in broken state.
Anyway this was only a very small percentage, so worries me less then the rest :)

Is there for example a chance that the filesystem gets corrupted the moment
the SAN storage was back accessible?
Hmm... the one possibility I can think of off the top of my head is
that in order to mark the file system as containing an error, we need
to write to the superblock. The head of the linked list of orphan
inodes is also in the superblock. If that had gotten modified in the
intervening 20-25 minutes, it's possible that this would result in
orphaned inodes not on the linked list, causing that error.

It doesn't explain the more severe cases of corruption, though.
If fixing that would have left us with only 10 corrupt disks instead of 500, would be a big win :)

I also have some snapshot available of a corrupted disk if some additional
debugging info is required.
Before e2fsck was run? Can you send me a copy of the output of
dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
copy of that snapshot?
Sure:
dumpe2fs -> see attachment

Fsck:
# e2fsck -fy /dev/mapper/vg01-root
e2fsck 1.44.5 (15-Dec-2018)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 165708 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(863328--863355)
Fix? yes

Free blocks count wrong for group #26 (3485, counted=3513).
Fix? yes

Free blocks count wrong (1151169, counted=1151144).
Fix? yes

Inode bitmap differences:  -4401 -165708
Fix? yes

Free inodes count wrong for group #0 (2489, counted=2490).
Fix? yes

Free inodes count wrong for group #20 (1298, counted=1299).
Fix? yes

Free inodes count wrong (395115, counted=395098).
Fix? yes


/dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous), 882520/2033664 blocks


It would be great to gather some feedback on how to improve the situation
(next to of course have no SAN outage :)).
Something that you could consider is setting up your system to trigger
a panic/reboot on a hung task timeout, or when ext4 detects an error
(see the man page of tune2fs and mke2fs and the -e option for those
programs).

There are tradeoffs with this, but if you've lost the SAN for 15-30
minutes, the file systems are going to need to be checked anyway, and
the machine will certainly not be serving. So forcing a reboot might
be the best thing to do.
Going to look into that! Thanks for the info.
On KVM for example there is a unlimited timeout (afaik) until the storage is
back, and the VM just continues running after storage recovery.
Well, you can adjust the SCSI timeout, if you want to give that a try....
It has some other disadvantages? Or is it quite safe to increment the SCSI timeout?

Cheers,

- Ted





[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux