Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 22 Sep 2021 23:57:35 -0400

On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote:
> Wouldn't it make more sense to use "write-same" of 0 instead of
> writing a page of zeros and task the layers that do thin
> provisioning and return 0 on read from unallocated blocks to check
> if a block exists before writing zeros to it?

The problem is we have absolutely no idea what "write-same" of 0 will
actually do in terms of whether it will consume storage for various
thinly provisioned devices.  We also have no idea what the performance
might be.  It might be the same speed as explicitly passing in
zero-filled buffers and sending DMA requests to a hard drive.  (e.g.,
potentially very S-L-O-W.)

That's technically true for "discard" as well, except there's a vague
understanding that discard will generally be faster than writing all
zeros --- it's just that it might also be a no-op, or it might
randomly be a no-op, depending on the phase of the moon, or anything
other random variable, including whether "the storage device feels
like it or not".

Bottom line --- unfortunately, the SATA/SCSI standards authors were
mealy-mouthed and made discard something which is completely useless
for our purposes.  And since we don't know anything about the
performance of write same and what it might do from the perspective of
thin-provisioned storage, we can't really depend on it either.

The problem is mke2fs really does need to care about the performance
of discard or write same.  Users want mke2fs to be fast, especially
during the distro installation process.  That's why we implemented the
lazy inode table initialization feature in the first place.  So
reading all each block from the inode table to see if it's zero might
be slow, and so we might be better off just doing the lazy itable init
instead.

Hence, I think Sarthak's approach of giving an explicit hint is a good
approach.

The other approach we can use is to depend on metadata checksums, and
the fact that a new file system will use a different UUID for the seed
for the checksum.  Unfortunately, in order to make this work well, we
need to change e2fsck so that if the checksum doesn't work out ---
especially if all of the checksums in an inode table block are
incorrect --- we need to assume that it means we should just presume
that the inode table block is from an old instance of the file system,
and return a zero-filled block when reading that inode table block.
(Right now, e2fsck still offers the chance to just fix the checksum,
back when we were worried there might be bugs in the metadata checksum
code.)

But I don't think the two approaches are mutually exclusive.  The
approach of an explicit hint is a "safe" and a lot easier to review.

Cheers,

					- Ted