On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote: > Hi, > > For things like database journals using fallocate(0) is not sufficient, > as writing into the the pre-allocated data with O_DIRECT | O_DSYNC > writes requires the unwritten extents to be converted, which in turn > requires journal operations. > > The performance difference in a journalling workload (lots of > sequential, low-iodepth, often small, writes) is quite remarkable. Even > on quite fast devices: > > andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts > /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 > > andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file > > andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s > > andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s > > andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file > > andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync > z262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s > > andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s > > > The way around that, from a database's perspective, is obviously to just > overwrite the file "manually" after fallocate()ing it, utilizing larger > writes, and then to recycle the file. > > > But that's a fair bit of unnecessary IO from userspace, and it's IO that > the kernel can do more efficiently on a number of types of block > devices, e.g. by utilizing write-zeroes. > > > Which brings me to $subject: > > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that > doesn't convert extents into unwritten extents, but instead uses > blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4 > myself, but ... We have explicit requests from users (think initialising large VM images) that FALLOC_FL_ZERO_RANGE must never fall back to writing zeroes manually. Because those users want us to guarantee that FALLOC_FL_ZERO_RANGE is *always* going to be faster than writing a large range of zeroes. They also want FALLOC_FL_ZERO_RANGE to fail if it can't zero the range by metadata manipulation and would need to write zeros, because then they can make the choice on how to initialise the device (e.g. at runtime, via on-demand ZERO_RANGE calls, by writing zeroes to pad partial blocks, etc). That bird has already flown, so we can't really do that retrospectively, but we really don't want to make life worse for these users. IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write zeros, we have users who explicitly don't want it to do this. Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the filesystem to convert an unwritten range of zeros to a written range by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero the range and fill holes using metadata manipulation, followed by FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real written zeros. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx