Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled

Free Ekanayaka <free.ekanayaka@xxxxxxxxx> · Wed, 10 Jan 2024 11:19:42 +0000

Jan Kara <jack@xxxxxxx> writes:

[...]

>> I've re-ran the same code with kernel 6.5.0 and indeed the behavior has
>> changed and an actual NVMe flush command seems to be issued (the flags
>> passed to nvme_setup_cmd match the ones that I see in the case I write
>> to the raw block device). So that part seems fixed. I thought I had
>> tried with 6.5.1 when I had posted this issue, and that it was present
>> there too, but maybe I was mistaken.
>> 
>> I'm not entirely sure what you mean about flushing the inode buffer
>> properly. As far as I see, the current behavior I see matches what I'd
>> expect.
>
> The thing is: fallocate(2) does preallocate blocks to the file so
> block allocation is not needed when writing to those blocks. However that
> does not mean metadata changes are not needed when writing to those blocks.
> In case of ext4 (and other filesystems such as xfs or btrfs as well) blocks
> allocated using fallocate(2) are tracked as unwritten in inode's metadata (so
> reads from them return 0 instead of random garbage). After block contents
> is written with iouring write, we need to convert the state of blocks from
> unwritten to written and that needs metadata modifications. So the first
> write to the file after fallocate(2) call still needs to do metadata
> modifications and these need to be persisted as part of fdatasync(2). And
> by mistake this did not happen in nojournal mode.

Thanks for the explanation.

Since I want to achieve the lowest possible latency for these writes, is
there a way to not onlly preallocate blocks with fallocate() but also to
avoid the extra write for metadata modifications that you mention?

I mean, I could of course take the brute force approach of performing
dump "pre" writes of those blocks, but I'm wondering if there's a better
way that doesn't require writing those blocks twice (which might end up
defeating the purpose of lowering overall latency).

>
>> For reference I'm attaching below the trace of the same user code, this
>> time run on kernel 6.5.0, which is the one currently shipping with
>> Debian/testing. Note that there are quite a bit less trace lines emitted
>> by the ext4 sub-system, not sure if it's related/relevant.
>> 
>>   raft-benchmark-35708   [000] .....  9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
>>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
>>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
>>   raft-benchmark-35708   [000] .....  9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4]
>>   raft-benchmark-35708   [000] .....  9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4]
>>   raft-benchmark-35708   [000] .....  9203.271119: block_touch_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-35708   [000] .....  9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-35708   [000] .....  9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
>>   raft-benchmark-35708   [000] .....  9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
>>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168
>>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271124: block_plug: [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
>>   raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] d.h..  9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79
>>   raft-benchmark-35708   [000] d.h..  9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0
>>   raft-benchmark-35708   [000] d.h..  9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0]
>>   raft-benchmark-35708   [000] dNh..  9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark]
>>   raft-benchmark-35708   [000] ...1.  9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
>>   raft-benchmark-35708   [000] .....  9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1
>
> So in this case the file blocks seem to have been already written.

Do you mean that they have been already written at some point in the
past after the file system was created?

I guess it doesn't matter if they were written as part of the new file
that is being written (and that was created with open()/fallocate()), or
if they were written before as part of some other file that was deleted
since then.

> In this
> case we don't need to do any block conversions (thus no metadata changes)
> and we also submit the write including the flush in the single block
> request. In the trace:
>
> raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>
> this line shows the request submission. The type of request is 'WFS' which
> means 'write' with 'flush' (this makes sure write does not just go to
> devices' cache) and 'sync' (which just says somebody is waiting for
> completion of the IO so it should be treated with priority by IO scheduling
> algorithms).

Thanks, that's helpful.