Re: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 4 Jan 2021 10:19:58 -0800

On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> Hi,
> 
> For things like database journals using fallocate(0) is not sufficient,
> as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
> writes requires the unwritten extents to be converted, which in turn
> requires journal operations.
> 
> The performance difference in a journalling workload (lots of
> sequential, low-iodepth, often small, writes) is quite remarkable. Even
> on quite fast devices:
> 
>     andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
>     /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
> 
>     andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
> 
>     andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     z262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
> 
> 
> The way around that, from a database's perspective, is obviously to just
> overwrite the file "manually" after fallocate()ing it, utilizing larger
> writes, and then to recycle the file.
> 
> 
> But that's a fair bit of unnecessary IO from userspace, and it's IO that
> the kernel can do more efficiently on a number of types of block
> devices, e.g. by utilizing write-zeroes.
> 
> 
> Which brings me to $subject:
> 
> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> doesn't convert extents into unwritten extents, but instead uses
> blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
> myself, but ...
> 
> Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
> sense, as that'd work reasonably efficiently to initialize newly
> allocated space as well as for zeroing out previously used file space.
> 
> 
> As blkdev_issue_zeroout() already has a fallback path it seems this
> should be doable without too much concern for which devices have write
> zeroes, and which do not?

Question: do you want the kernel to write zeroes even for devices that
don't support accelerated zeroing?  Since I assume that if the fallocate
fails you'll fall back to writing zeroes from userspace anyway...

Second question: Would it help to have a FALLOC_FL_DRY_RUN flag that
could be used to probe if a file supports fallocate without actually
changing anything?  I'm (separately) pursuing a fix for the loop device
not being able to figure out if a file actually supports a particular
fallocate mode.

--D

> Greetings,
> 
> Andres Freund