Re: General Filesystem Question - Interesting Unexplainable Observation

Jan Kara <jack@xxxxxxx> · Wed, 2 Nov 2022 15:22:20 +0100

Hello Matt!

On Wed 02-11-22 03:07:55, Matt Bobrowski wrote:
> On Mon, Oct 31, 2022 at 12:22:37PM +0100, Jan Kara wrote:
> > Hi Matthew!
> > 
> > [added ext4 mailing list to CC, maybe others have more ideas]
> > 
> > On Fri 28-10-22 23:23:14, Matt Bobrowski wrote:
> > > Just had a general question in regards to some recent filesystem (ext4)
> > > behaviour I've recently observed, which kind of made my eyebrows raise a
> > > little and I wanted to understand why this was happening.
> > > 
> > > We have an application (single threaded process) that basically performs
> > > the following sequence of filesystem operations using buffered I/O:
> > > 
> > > ---
> > > fd = open("dir/tmp/filename.new", O_WRONLY | O_CREAT | O_TRUNC, 0400);
> > > ...
> > > write(fd, buf, sizeof(buf));
> > > ...
> > > rename("dir/tmp/filename.new", "dir/new/filename");
> > > ---
> > > 
> > > At times, I see the "dir/new/filename" file size reporting 0 bytes, despite
> > > sizeof(buf) written to "dir/tmp/filename.new" always guaranteed to be > 0
> > > and the result of the write reported as being successful. This is the part
> > > I cannot come up with a valid explanation for (yet).
> > 
> > So by "file size reporting 0 bytes" do you mean that
> > stat("dir/new/filename") from a concurrent process returns file size 0
> > sometimes?
> 
> Not quite, meaning that stat("dir/new/filename") is reporting 0 bytes
> long after the write(2) operation had occurred. IOW, I'm seeing 0 byte
> files laying around when they well and truly should have had bytes
> written out to them (before a write(2) is issued we check to make sure
> that the supplied buffer actually has something in it) i.e. manually
> stat'ing them in a shell.

I see. So inode got written with 0 size to the disk.

> > Or do you refer to a situation after an unclean filesystem
> > shutdown?
> 
> It could very well be from an unclean shutdown, but it's really hard
> to say whether this is the culprit or not.

I see, ok.

> > > Understandably,
> > > there's no fsync being currently performed post calling write, which I
> > > think needs to be corrected, but I also can't see how not using fsync post
> > > write would result in the file size for "dir/new/filename" being reported
> > > as 0 bytes? One of the things that crossed my mind was that the rename
> > > operation was possibly being committed prior to the dirty pages from the
> > > pagecache being flushed, but regardless I don't see how a rename would
> > > result in the data blocks associated to the write not ever being committed
> > > for the same underlying inode?
> > > 
> > > What are your thoughts? Any plausible explanation why I might be seeing
> > > this odd behaviour?
> > 
> > Ext4 uses delayed allocation. That means that write(2) just stores data in
> > the page cache but no blocks are allocated yet. So indeed rename(2) can be
> > fully committed in the journal before any of the data gets to persistent
> > storage. That being said ext4 has a workaround for buggy applications (can
> > be disabled with "noauto_da_alloc" mount option) that starts data writeback
> > before rename is done so at least in data=ordered mode you should not see 0
> > length files after a crash with the above scheme.
> 
> Right, we are using buffered I/O after all... However, even if the
> rename(2) operation took place and was fully committed to the journal
> before the dirty pages associated to the prior write(2) had been
> written back, I wouldn't expect the data to be missing? IOW, the
> write(2) and rename(2) operations are taking effect on the same
> backing inode, no?

No. Because inode size changes as well as block allocation changes get
added to the journal only once the writeback happens. So until writeback
starts, rename(2) and write(2) can be arbitratily reordered (or you can
even see only part of the write being completed).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR