On 2022/9/1 22:45, Theodore Ts'o wrote: > On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote: >> >> Our case is modifing a no-empty file names "data.conf" through vim, it will >> cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp" >> back to "data.conf" and fsync after modifying. If we power down between rename >> and fsync, we may lose all data in the original "data.conf" and get a whole >> zero file. Before we enable dioread_nolock by defalut, the newly appending data >> may lost, but at least we will not lose the original data in the default >> data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the >> guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename >> in vim ? > > What I normally recommend is to write the new contents of "data.conf" > to "data.conf.new". Then fsync the file descriptor corresponding to > data.conf.new. If you want to backup data.conf, you can create a hard > link of data.conf to data.conf.bak, then rename data.conf.new on top > of data.conf. > > The auto_da_alloc mode is a best effort attempt to protect against bad > application programs, and it never was a guarantee that it would > *always* protect against data loss if you had a overwriting rename > racing against a power failure. Without auto_da_alloc mode, the > window where the user might lose data if they application failed to do > the fsync() was 30 seconds. With auto_da_alloc (and if dioread_nolock > was disabled), that window was reduced to say, a few hundred Thanks for the explanation and suggestions, I totally agree with you that applications SHOULD be calling fsync() before an overwriting rename() and the filesystem just do best efforts. For the above "with auto_da_alloc and no dioread_nolock" case, I'm not sure I understand right, please correct me if I say wrong. With auto_da_alloc, data=ordered and no dioread_nolock, it separates into two cases: 1) the new written contents was appended to an allocated block; 2) the new contents was written to a new block. For case 1, we have a few hundred milliseconds window as you said, because the inode will not be added to transaction->t_inode_list, so the transaction committing progress does not wait data blocks to write to complete. For case 2, the inode will be added to transaction->t_inode_list, we can guarantee that data write to disk before the rename() completes, so there is no window for this case. Thanks, Yi. > milliseconds. With dioread_nolock, that window was increased to > somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the > time for the write to complete (a few hundred milliseconds). > > But note that your proposed patch doesn't actually really improve > things all that much. That's because calling filemap_datawrite() only > waits for the write request to complete. But you still need to update > the metadata before the blocks become visible after a crash. That > requires forcing a journal commit, and waiting for that commit to > complete, which is what ext4_sync_file() does, and which is of course, > quite expensive. > > We could add something to the end of ext4_convert_unwritten_io_end_vec() > where if the inode is da_alloc eligible, we trigger an asynchronous > journal commit; that is, once we converted all of the unwritten extents, > if the file has been closed after being opened with O_TRUNC, or the file has > been renamed on top of a pre-existing file and there were dirty blocks > that were flushed out when we called ext4_alloc_da_blocks(), that > ext4_convert_unwritten_io_end_vec() would then request that a journal > commit be started. But we'd want to see what sort of performance regression > this adds, since nothing is for free, and the goal here is to paper over > buggy applications without imposing too much of a performance cost. > > But it should be clear that this is *buggy* applications, and > applications SHOULD be calling fsync() before an overwriting rename() > if they care about data not being lost if there is a power failure at > an inconvenient point. All we are trying to do here is to reduce the > probability of data loss caused by buggy applications. There was never > any guarantees. > > Cheers, > > - Ted > > . >