Re: [PATCH -next] ext4: ensure data forced to disk when rename

Zhang Yi <yi.zhang@xxxxxxxxxx> · Fri, 2 Sep 2022 11:31:25 +0800

On 2022/9/1 22:45, Theodore Ts'o wrote:
> On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote:
>>
>> Our case is modifing a no-empty file names "data.conf" through vim, it will
>> cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp"
>> back to "data.conf" and fsync after modifying. If we power down between rename
>> and fsync, we may lose all data in the original "data.conf" and get a whole
>> zero file. Before we enable dioread_nolock by defalut, the newly appending data
>> may lost, but at least we will not lose the original data in the default
>> data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the
>> guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename
>> in vim ?
> > What I normally recommend is to write the new contents of "data.conf"
> to "data.conf.new".  Then fsync the file descriptor corresponding to
> data.conf.new.  If you want to backup data.conf, you can create a hard
> link of data.conf to data.conf.bak, then rename data.conf.new on top
> of data.conf.
> 
> The auto_da_alloc mode is a best effort attempt to protect against bad
> application programs, and it never was a guarantee that it would
> *always* protect against data loss if you had a overwriting rename
> racing against a power failure.  Without auto_da_alloc mode, the
> window where the user might lose data if they application failed to do
> the fsync() was 30 seconds.  With auto_da_alloc (and if dioread_nolock
> was disabled), that window was reduced to say, a few hundred

Thanks for the explanation and suggestions, I totally agree with you that
applications SHOULD be calling fsync() before an overwriting rename() and
the filesystem just do best efforts.

For the above "with auto_da_alloc and no dioread_nolock" case, I'm not sure
I understand right, please correct me if I say wrong. With auto_da_alloc,
data=ordered and no dioread_nolock, it separates into two cases: 1) the new
written contents was appended to an allocated block; 2) the new contents
was written to a new block.

For case 1, we have a few hundred milliseconds window as you said,
because the inode will not be added to transaction->t_inode_list, so
the transaction committing progress does not wait data blocks to
write to complete. For case 2, the inode will be added to
transaction->t_inode_list, we can guarantee that data write to disk
before the rename() completes, so there is no window for this case.

Thanks,
Yi.

> milliseconds.  With dioread_nolock, that window was increased to
> somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the
> time for the write to complete (a few hundred milliseconds).
> 
> But note that your proposed patch doesn't actually really improve
> things all that much.  That's because calling filemap_datawrite() only
> waits for the write request to complete.  But you still need to update
> the metadata before the blocks become visible after a crash.  That
> requires forcing a journal commit, and waiting for that commit to
> complete, which is what ext4_sync_file() does, and which is of course,
> quite expensive.
> 
> We could add something to the end of ext4_convert_unwritten_io_end_vec()
> where if the inode is da_alloc eligible, we trigger an asynchronous
> journal commit; that is, once we converted all of the unwritten extents,
> if the file has been closed after being opened with O_TRUNC, or the file has
> been renamed on top of a pre-existing file and there were dirty blocks
> that were flushed out when we called ext4_alloc_da_blocks(), that
> ext4_convert_unwritten_io_end_vec() would then request that a journal
> commit be started.   But we'd want to see what sort of performance regression
> this adds, since nothing is for free, and the goal here is to paper over
> buggy applications without imposing too much of a performance cost.
> 
> But it should be clear that this is *buggy* applications, and
> applications SHOULD be calling fsync() before an overwriting rename()
> if they care about data not being lost if there is a power failure at
> an inconvenient point.   All we are trying to do here is to reduce the
> probability of data loss caused by buggy applications.  There was never
> any guarantees.
> 
> Cheers,
> 
> 						- Ted
> 
> .
>