Hi John, On Mon, Dec 04, 2023 at 02:44:36PM +0000, John Garry wrote: > On 04/12/2023 13:38, Ojaswin Mujoo wrote: > > > So are we supposed to be doing atomic writes on unwritten ranges only in the > > > file to get the aligned allocations? > > If we do an atomic write on a hole, ext4 will give us an aligned extent > > provided the hole is big enough to accomodate it. > > > > However, if we do an atomic write on a existing extent (written or > > unwritten) ext4 would check if it satisfies the alignment and length > > requirement and returns an error if it doesn't. > > This seems a rather big drawback. So if I'm not wrong, we force the extent alignment as well as the size of the extent in xfs right. We didn't want to overly restrict the users of atomic writes by forcing the extents to be of a certain alignment/size irrespective of the size of write. The design in this patchset provides this flexibility at the cost of some added precautions that the user should take (eg not doing an atomic write on a pre existing unaligned extent etc). However, I don't think it should be too hard to provide an optional forced alignment feature on top of this if there's interest in that. > > > Since we don't have cow > > like functionality afaik the only way we could let this kind of write go > > through is by punching the pre-existing extent which is not something we > > can directly do in the same write operation, hence we return error. > > Well, as you prob saw, for XFS we were relying on forcing extent alignment, > and not CoW (yet). That's right. > > > > > > I actually tried that, and I got a WARN triggered: > > > > > > # mkfs.ext4 /dev/sda > > > mke2fs 1.46.5 (30-Dec-2021) > > > Creating filesystem with 358400 1k blocks and 89760 inodes > > > Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 > > > Superblock backups stored on blocks: > > > 8193, 24577, 40961, 57345, 73729, 204801, 221185 > > > > > > Allocating group tables: done > > > Writing inode tables: done > > > Creating journal (8192 blocks): done > > > Writing superblocks and filesystem accounting information: done > > > > > > [ 12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left > > > # mount /dev/sda mnt > > > [ 12.798804] EXT4-fs (sda): mounted filesystem > > > 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota > > > mode: none. > > > # touch mnt/file > > > # > > > # /test-statx -a /root/mnt/file > > > statx(/root/mnt/file) = 0 > > > dump_statx results=5fff > > > Size: 0 Blocks: 0 IO Block: 1024 regular file > > > Device: 08:00 Inode: 12 Links: 1 > > > Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 > > > Access: 2023-12-04 10:27:40.002848720+0000 > > > Modify: 2023-12-04 10:27:40.002848720+0000 > > > Change: 2023-12-04 10:27:40.002848720+0000 > > > Birth: 2023-12-04 10:27:40.002848720+0000 > > > stx_attributes_mask=0x703874 > > > STATX_ATTR_WRITE_ATOMIC set > > > unit min: 1024 > > > uunit max: 524288 > > > Attributes: 0000000000400000 (........ ........ ........ ........ > > > ........ .?--.... ..---... .---.-..) > > > # > > > > > > > > > > > > looks ok so far, then write 4KB at offset 0: > > > > > > # /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file > > > file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24 > > ... > > > > Please note that I tested on my own dev branch, which contains changes over > > > [1], but I expect it would not make a difference for this test. > > Hmm this should not ideally happen, can you please share your test > > script with me if possible? > > It's doing nothing special, just RWF_ATOMIC flag is set for DIO write: > > https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c Thanks for the script, will try to replicate this today and get back to you. > > > > > > > > > Script to test using pwritev2() can be found here: > > > > https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9 > > > Please note that the posix_memalign() call in the program should PAGE align. > > Why do you say that? direct IO seems to be working when the userspace > > buffer is 512 byte aligned, am I missing something? > > ah, sorry, if you just use 1x IOV vector then no special alignment are > required, so ignore this. Indeed, I need to improve kernel checks for > alignment anyway. > > Thanks, > John > Regards, ojaswin