Re: fsck failing to notice that the block device was pulled out from under it?

Tomas Pospisek <tpo2@xxxxxxxxxxxxx> · Tue, 26 May 2015 08:39:24 +0200

Am 26.05.2015 um 03:14 schrieb Theodore Ts'o:
> On Mon, May 25, 2015 at 11:59:44PM +0200, Tomas Pospisek wrote:
>> Hello,
>>
>> tl;dr: it seems like fsck fails to notice when the block device
>> disappears from under it.
> 
> When a block device disappears, reads (and writes) using the file
> descriptor open on the block device will return errors, and that is
> how e2fsck "notices".  And as fsck is concerned, the only "block
> device" which is is interacting with is the device mapper node which
> is exported by the LUKS encrypted device --- and the problem is that
> the device mapper node is *not* disappearing.
> 
>> Nevertheless fsck was happily continuing with its disk check.
>>
>> So I think there are a few parts broken in this chain of layers. The one
>> that I can put a finger on is that fsck should notice or should be
>> notified when the block device under it ceases to exist, as is the case
>> when the LUKS device becomes locked again.
>>
>> I'm not sure why fsck doesn't notice. Doesn't it get the right
>> information from the LUKS block device?
> 
> Apparently not.  I think you need to complain to the LUKS and
> device-mapper developers.  I will note that some device-mapper nodes
> are *designed* to hide the fact that one or more of the underlying
> block device might have disappeared --- for example, in the case of
> dm_multipath or dm_raid device, you want the exported device-mapper
> "block device" to survive even if one or more of the underyling
> constituent block devices have disappeared.  That's the whole point of
> those device-mapper nodes.
> 
>> The end result of this is, that my backups are lost. The disk can still
>> be read, but LUKS is no more able to decipher it.
> 
> So that seems weird.  I don't know why LUKS would be corrupting the
> device just because of a USB disconnect.  As I said, the worst that
> *should* happen is that reads and writes should be returning I/O
> errors.  But this is a LUKS / dm_crypt problem, so you should be
> raising this question with the device mapper folks.

Thanks a lot for your explanation Ted!

One more question if I may. You have in principle already answered that
question, however I want to be sure about it. Who is it that is writing
this to the kernel log:

    May 25 12:39:51 hier kernel: [79872.773327] Buffer I/O error on
device dm-0, logical block 68681774
    May 25 12:39:51 hier kernel: [79872.773328] lost page write due to
I/O error on dm-0

is it the layers *below* the ext4 module that are reporting this?

Again, thanks a lot for your explanation!
*t

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users