Re: [PATCH v2] Add support for new compat feature "super_sparse"

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 16 Jan 2014 13:21:47 -0700

On Jan 14, 2014, at 9:08 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> On Tue, Jan 14, 2014 at 04:21:52AM -0700, Andreas Dilger wrote:
>> A few comments on this new patch:
>> - I think the name will be confusing to users, especially non-native English speakers. Is it "sparse_super" or "super_sparse" they want?
> 
> Yes, good point.  Maybe sparse_super2?  More generally, I don't think
> we want most users of mke2fs ever needing or wanting to use these
> features.  We can kind of handle this by using "mke2fs -T smr", or
> some such, but this is related to something I've been thinking about
> for a while, which is a way of collapsing the following from dumpe2fs:
> 
> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
> 
> ... into something like this.
> 
> Filesystem features:      ext4_default_set needs_recovery

I'm OK with this in theory, but it would make it harder to know what
features are actually enabled, especially if "ext4_default_set" is
changing over time.  Also, while this might be OK for "dumpe2fs"
output, it shouldn't be used for the debugfs "features" command
output, since that would break the ability to determine what features
are actually implemented.

>> - I would suspect that group #1 is not the best place to put the backup.
>>  For very large filesystems, there is a conflict with the backup group
>>  descriptors in group #0 and #1. It would be better to out the one
>>  backup in group #3 or something.  I don't think this will be a problem
>>  for SMR drives, since they will be so large that this will easily fit inside
>>  (or close to) the flex_bg layout of the inode table.
> 
> I'm not sure what what you mean by "conflict with the backup
> descriptors in #0 and #1"?

In 4kB blocksize filesystems with 64-bit group descriptors, there
are 64 group descriptors per block, so for the 32k blocks in group
#0 this means a maximum of 32767 * 64 ~= 2M groups = 255TB before
the group #0 group descriptors collide with the group #1 superblock
and group #1 descriptor backups.

This problem would be avoided by meta_bg, but that also reverts back
to the undesirable behaviour of spreading small metadata chunks all
over the filesystem.  In some respects, meta_bg would be worse than
the normal sparse_super for SMR, since it writes a few blocks every
64 groups, while sparse_super will write a larger number of blocks
together but less often.

It might make sense to combine meta_bg and flex_bg in this case so
that the superblock and its backups are kept in the same groups as
the bitmaps.  That avoids metadata being spread around the disk.

> One reason why I'm inclined to leave a backup at group #1 is that for
> most file systems, sysadmins are trained to know that there is a
> backup at -b 32768.  If we change it to be something else, it makes it
> a bit harder to find the backup sb, which is a consideration.

I thought that e2fsprogs automatically tries to read all of the
backup superblock and group descriptors if the primary fails, so
as long as it is kept in one of the "known" groups it should be
found automatically?

> Yes, bigalloc does change the offset, but that's actually another
> solution I had been looking at for our use case inside google for big
> SMR drives.
> 
> 
>> - To simplify matters, it makes sense that super_sparse supersedes
>>  the sparse_super and meta_bg features. It doesn't make sense
>>  to have both. Should it also require flex_bg?  Without it, it is mostly
>>  useless. 
> 
> Actually, it doesn't supercede meta_bg.  Meta_bg is about where to put
> the block group descriptors to allow for 64-bit online resize, such
> that the bg descriptor blocks are no longer contiguous.  This is
> separate and distinct from the question of which block group have a
> superblock and the contiguous (aka "old-style") set of block group
> descriptors as backup.
> 
> I agree that for the use case of keeping the data blocks contiguous,
> it only makes sense to use it with flex_bg; but the file systems
> options are largely orthogonal, and it doesn't actually simplify
> anything from a code complexity standpoint to require them.  How we
> make it easy for users to request a certain set of features is a
> different question, and that's where I think ultimately mke2fs's -T
> option is going to come in really handy.
> 
> 					- Ted

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail