Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 06, 2025 at 03:27:52AM -0800, Christoph Hellwig wrote:
> There's a feature request for something similar on the xfs list, so
> I guess people are asking for it.

Yeah, I have folks asking for this on the ext4 side as well.

The one caution that I've given to them is that there is no guarantee
what the performance will be for WRITE SAME or equivalent operations,
since the standards documents state that performance is out of scope
for the document.  So in some cases, WRITE SAME might be fast (if for
example it is just adjusing FTL metadata on an SSD, or some similar
thing on cloud-emulated block devices such as Google's Persistent Desk
or Amazon's Elastic Block Device --- what Darrick has called "software
defined storage" for the cloud), but in other hardware deployments,
WRITE SAME might be as slow as writing zeros to an HDD.

This is technically not the kernel's problem, since we can also use
the same mealy-mouth "performance is out of scope and not the kernel's
concern", but that just transfers the problem to the application
programmers.  I could imagine some kind of tunable which we can make
the block device pretend that it really doesn't support using WRITE
SAME if the performance characteristics are such that it's a Bad Idea
to use it, so that there's a single tunable knob that the system
adminstrator can reach for as opposed to have different ways for
PostgresQL, MySQL, Oracle Enterprise Database, etc have for
configuring whether or not to disable WRITE SAME, but that's not
something we need to decide right away.

> That being said this really should not be a modifier but a separate
> operation, as the logic is very different from FALLOC_FL_ZERO_RANGE,
> similar to how plain prealloc, hole punch and zero range are different
> operations despite all of them resulting in reads of zeroes from the
> range.

Yes.  And we might decide that it should be done using some kind of
ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
since it really isn't a filesystem metadata operation, just as
BLKDISARD isn't.  The other side of the argument is that ioctls are
ugly, and maybe all new such operations should be plumbed through via
fallocate as opposed to adding a new ioctl.  I don't have strong
feelings on this, although I *do* belive that whatever interface we
use, whether it be fallocate or ioctl, it should be supported by block
devices and files in a file system, to make life easier for those
databases that want to support running on a raw block device (for
full-page advertisements on the back cover of the Businessweek
magazine) or on files (which is how 99.9% of all real-world users
actually run enterprise databases.  :-)

						- Ted




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux