Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Tue, 7 Jan 2025 19:22:50 +0800

On 2025/1/7 0:17, Theodore Ts'o wrote:
> On Mon, Jan 06, 2025 at 03:27:52AM -0800, Christoph Hellwig wrote:
>> There's a feature request for something similar on the xfs list, so
>> I guess people are asking for it.
> 
> Yeah, I have folks asking for this on the ext4 side as well.
> 
> The one caution that I've given to them is that there is no guarantee
> what the performance will be for WRITE SAME or equivalent operations,
> since the standards documents state that performance is out of scope
> for the document.  So in some cases, WRITE SAME might be fast (if for
> example it is just adjusing FTL metadata on an SSD, or some similar
> thing on cloud-emulated block devices such as Google's Persistent Desk
> or Amazon's Elastic Block Device --- what Darrick has called "software
> defined storage" for the cloud), but in other hardware deployments,
> WRITE SAME might be as slow as writing zeros to an HDD.
> 
> This is technically not the kernel's problem, since we can also use
> the same mealy-mouth "performance is out of scope and not the kernel's
> concern", but that just transfers the problem to the application
> programmers.  I could imagine some kind of tunable which we can make
> the block device pretend that it really doesn't support using WRITE
> SAME if the performance characteristics are such that it's a Bad Idea
> to use it, so that there's a single tunable knob that the system
> adminstrator can reach for as opposed to have different ways for
> PostgresQL, MySQL, Oracle Enterprise Database, etc have for
> configuring whether or not to disable WRITE SAME, but that's not
> something we need to decide right away.

Yes, I completely agree with you. At this time, it is not possible to
determine whether a disk supports fast write zeros only by checking if
the disk supports the write_zero command. Especially for some HDDs,
which should submit actual zeros to the disk even if they claim to
support the write_zero command, but that is very slow.

Therefore, I propose that we add a new feature flag, such as
BLK_FEAT_FAST_WRITE_ZERO, to queue->limits.features. This flag should
be set by each disk driver if the attached disk supports fast write
zeros. For instance, the NVMe SSD driver should set this flag if the
given namespace supports NVME_NS_DEAC. Additionally, we can add an
entry in sysfs that allows the user to enable and disable this feature
manually when the driver don't know the whether the disk supports it
or not for some corner cases.

> 
>> That being said this really should not be a modifier but a separate
>> operation, as the logic is very different from FALLOC_FL_ZERO_RANGE,
>> similar to how plain prealloc, hole punch and zero range are different
>> operations despite all of them resulting in reads of zeroes from the
>> range.
> 
> Yes.  And we might decide that it should be done using some kind of
> ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
> since it really isn't a filesystem metadata operation, just as
> BLKDISARD isn't.  The other side of the argument is that ioctls are
> ugly, and maybe all new such operations should be plumbed through via
> fallocate as opposed to adding a new ioctl.  I don't have strong
> feelings on this, although I *do* belive that whatever interface we
> use, whether it be fallocate or ioctl, it should be supported by block
> devices and files in a file system, to make life easier for those
> databases that want to support running on a raw block device (for
> full-page advertisements on the back cover of the Businessweek
> magazine) or on files (which is how 99.9% of all real-world users
> actually run enterprise databases.  :-)
> 

For this part, I still think it would be better to use fallocate.

Thanks,
Yi.