Re: [PATCH] ext4: defer updating i_disksize until endio

Zhang Yi <yi.zhang@xxxxxxxxxx> · Wed, 29 Mar 2023 19:37:56 +0800

On 2023/3/29 11:36, Chung-Chiang Cheng wrote:
> On Mon, Mar 27, 2023 at 7:17 PM Zhang Yi <yi.zhang@xxxxxxxxxx> wrote:
>>
>> On 2023/3/27 18:28, Chung-Chiang Cheng wrote:
>>> It's a pity that this issue also occurs with data=ordered due to delayed
>>> allocation being enabled by default. If delayed allocation were disabled,
>>> it would not be as easy to reproduce.
>>>
>>> This is because if data is written to the end of a file and the block is
>>> allocated, the new i_disksize will be immediately committed to the journal
>>> at ext4_da_write_end(), but the writeback procedure is not yet triggered.
>>> By default, ext4 commits the journal every 5 seconds, but a dirty page may
>>> not be written back until 30 seconds later. This is not a short time window,
>>> and any improper shutdown during this time may lead to the issue :(
>>>
> 
> Thank you for the explanation from you and Jan. I agree that it is not the
> responsibility of ext4 to provide application consistency, but this case is
> not even crash consistent, although no sensitive data were revealed after
> crash.
> 
>> It seems that the case you've mentioned is intra-block append write (no?),
>> current data=ordered mount option doesn't work in this case because
>> ext4_map_blocks() doesn't attach inode to the t_inode_list of the running
>> transaction. If delayed allocation were disabled, the lose data window is still
>> there, because ext4_write_end()->ext4_update_inode_size() is also updating
>> i_disksize before writing data back. This is at least guarantee no store data.
>> We had discussed this in [1].
> 
> Yes, you're right. I've reconfirmed my experiment and determined that this
> issue can be reproduced with or without delayed allocation.
> 
> I've tried your previous solution of adding the required inode to the current
> transaction's ordered data list. It seems to work perfectly for me and simply
> solves the issue, but the journal handling needs to be added back to the
> delayed allocation write. Does it really have an obvious performance impact?
> 

It depends on the writing behavior (proportion of partial block write), I had
test fio sequence write with bs=1K[1] on the default 4K block size file system
(the cases were not enough, I hope that will be helpful to you), it's have
about 30% degradation at that time. I haven't test it recently, maybe it could
have more degradation than before at the delayed allocation write since we
removed the journal handling. Or you can test it on your products.

Thanks,
Yi.

[1] fio --name=foo --size=5G --bs=1k --numjobs=24 --iodepth=1 --rw=write \
        --norandommap --group_reporting --runtime=100 --time_based \
        --nrfiles=3 --directory=/mnt/ --fallocate=none --fsync=1