Hi Matthew! [added ext4 mailing list to CC, maybe others have more ideas] On Fri 28-10-22 23:23:14, Matt Bobrowski wrote: > Just had a general question in regards to some recent filesystem (ext4) > behaviour I've recently observed, which kind of made my eyebrows raise a > little and I wanted to understand why this was happening. > > We have an application (single threaded process) that basically performs > the following sequence of filesystem operations using buffered I/O: > > --- > fd = open("dir/tmp/filename.new", O_WRONLY | O_CREAT | O_TRUNC, 0400); > ... > write(fd, buf, sizeof(buf)); > ... > rename("dir/tmp/filename.new", "dir/new/filename"); > --- > > At times, I see the "dir/new/filename" file size reporting 0 bytes, despite > sizeof(buf) written to "dir/tmp/filename.new" always guaranteed to be > 0 > and the result of the write reported as being successful. This is the part > I cannot come up with a valid explanation for (yet). So by "file size reporting 0 bytes" do you mean that stat("dir/new/filename") from a concurrent process returns file size 0 sometimes? Or do you refer to a situation after an unclean filesystem shutdown? > Understandably, > there's no fsync being currently performed post calling write, which I > think needs to be corrected, but I also can't see how not using fsync post > write would result in the file size for "dir/new/filename" being reported > as 0 bytes? One of the things that crossed my mind was that the rename > operation was possibly being committed prior to the dirty pages from the > pagecache being flushed, but regardless I don't see how a rename would > result in the data blocks associated to the write not ever being committed > for the same underlying inode? > > What are your thoughts? Any plausible explanation why I might be seeing > this odd behaviour? Ext4 uses delayed allocation. That means that write(2) just stores data in the page cache but no blocks are allocated yet. So indeed rename(2) can be fully committed in the journal before any of the data gets to persistent storage. That being said ext4 has a workaround for buggy applications (can be disabled with "noauto_da_alloc" mount option) that starts data writeback before rename is done so at least in data=ordered mode you should not see 0 length files after a crash with the above scheme. WRT concurrent process seeing 0 file length, I would not have a great explanation for that because once data is written to the inode, inode->i_size is set to the final inode size which is what stat(2) reports. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR