On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote: > Wouldn't it make more sense to use "write-same" of 0 instead of > writing a page of zeros and task the layers that do thin > provisioning and return 0 on read from unallocated blocks to check > if a block exists before writing zeros to it? The problem is we have absolutely no idea what "write-same" of 0 will actually do in terms of whether it will consume storage for various thinly provisioned devices. We also have no idea what the performance might be. It might be the same speed as explicitly passing in zero-filled buffers and sending DMA requests to a hard drive. (e.g., potentially very S-L-O-W.) That's technically true for "discard" as well, except there's a vague understanding that discard will generally be faster than writing all zeros --- it's just that it might also be a no-op, or it might randomly be a no-op, depending on the phase of the moon, or anything other random variable, including whether "the storage device feels like it or not". Bottom line --- unfortunately, the SATA/SCSI standards authors were mealy-mouthed and made discard something which is completely useless for our purposes. And since we don't know anything about the performance of write same and what it might do from the perspective of thin-provisioned storage, we can't really depend on it either. The problem is mke2fs really does need to care about the performance of discard or write same. Users want mke2fs to be fast, especially during the distro installation process. That's why we implemented the lazy inode table initialization feature in the first place. So reading all each block from the inode table to see if it's zero might be slow, and so we might be better off just doing the lazy itable init instead. Hence, I think Sarthak's approach of giving an explicit hint is a good approach. The other approach we can use is to depend on metadata checksums, and the fact that a new file system will use a different UUID for the seed for the checksum. Unfortunately, in order to make this work well, we need to change e2fsck so that if the checksum doesn't work out --- especially if all of the checksums in an inode table block are incorrect --- we need to assume that it means we should just presume that the inode table block is from an old instance of the file system, and return a zero-filled block when reading that inode table block. (Right now, e2fsck still offers the chance to just fix the checksum, back when we were worried there might be bugs in the metadata checksum code.) But I don't think the two approaches are mutually exclusive. The approach of an explicit hint is a "safe" and a lot easier to review. Cheers, - Ted