Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 14 Oct 2021 21:21:29 -0600

On Oct 13, 2021, at 3:41 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> 
> On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
>> 
>> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>> 
>> hostA				hostB
>> 
>> read_mmp_block()		read_mmp_block()
>> - sees EXT4_MMP_SEQ_CLEAN	- sees EXT4_MMP_SEQ_CLEAN
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>>  - all OK, mount
>> 				write_mmp_block()
>> 				wait_time == 0 -> no wait
>> 				read_mmp_block()
>> 				  - all OK, mount
>> 
>> Do I get it right? Actually, if we passed seq we wrote in
>> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
>> notice the conflict on its first invocation but still that would be a bit
>> late because there would be a time window where hostA and hostB would be
>> both using the fs.

It would be enough to have even a short delay between write and read to
detect this case.  I _thought_ there should be a delay in this case,
but maybe it was removed after the patch was originally submitted?

>> We could reduce the likelyhood of this race by always waiting in
>> ext4_multi_mount_protect() between write & read but I guess that is
>> undesirable as it would slow down all clean mounts. Ted?
> 
> I'd like Andreas to comment here.  My understanding is that MMP
> originally intended as a safety mechanism which would be used as part
> of a primary/backup high availability system, but not as the *primary*
> system where you might try to have two servers simultaneously try to
> mount the file system and use MMP as the "election" mechanism to
> decide which server is going to be the primary system, and which would
> be the backup system.
> 
> The cost of being able to handle this particular race is it would slow
> down the mounts of cleanly unmounted systems.

Ted's understanding is correct - MMP is intended to be a backup mechanism
to prevent filesystem corruption in the case where external HA methods
do the wrong thing.  This has avoided problems countless times on systems
with multi-port access to the same storage, and can also be useful in the
case of shared VM images accessed over the network, and similar.

When MMP was implemented for ZFS, a slightly different mechanism was used.
Rather than having the delay to detect concurrent mounts, it instead writes
to multiple different blocks in a random order, and then reads them all.
If two nodes try to mount the filesystem concurrently, they would pick
different block orders, and the chance of them having the same order (and
one clobbering all of the blocks of the other) would be 1/2^num_blocks.
The drawback is that this would consume more space in the filesystem, but
it wouldn't be a huge deal these days.

> There *are* better systems to implement leader elections[1] than using
> MMP.  Most of these more efficient leader elections assume that you
> have a working IP network, and so if you have a separate storage
> network (including a shared SCSI bus) from your standard IP network,
> then MMP is a useful failsafe in the face of a network partition of
> your IP network.  The question is whether MMP should be useful for
> more than that.  And if it isn't, then we should probably document
> what MMP is and isn't good for, and give advice in the form of an
> application note for how MMP should be used in the context of a larger
> system.

One of the existing failure cases with HA that MMP detects is loss of
network connection, so I wouldn't want to depend on that.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP