Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems):

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 13 Jan 2025 09:00:17 -0500

On Mon, Jan 13, 2025 at 07:41:53AM +0000, Artem S. Tashkinov wrote:
> Let's say you chmod two files sequentially:
> 
> What's happening:
> 
> 1) the kernel looks up an inode
> 2) the kernel then updates the metadata (at the very least one logical
> block is being written, that's 4K bytes)
> 
> ditto for the second file.
> 
> Now let's say we are updating 10000 files.
> 
> Does this mean that at least 40MB of data will be written, when probably
> less than 500KB needs to be written to disk?

No, for pretty much all file systems, we don't force the data to be
written when you do a chmod.  This is true for both journalled and
non-journalled file systems.  For a non-journalled file system, we
will modify the in-memory inode structure, and we will wait until the
buffer cache writeback (typically 30 seconds after the block was first
dirtied) before the metadata block is written back.  So if you modify
the same block 32 times for 32 inodes, it won't get written until 30
seconds go by.  So as long as you actually complete the chmod -R
operation within 30 seconds, you'll be fine.

For non-journalled file system, the disk writes won't take place until
the transaction close time takes place, which is 5 seconds (by
default) before the transaction closes.

> == Issue number two ==
> 
> At least when you write data to the disk, the kernel doesn't flush it
> immediately and your system remains responsive due to the use of dirty
> buffers.
> 
> For operations involving metadata updates, the kernel may not have this
> luxury, because the system must be in a consistent state even if it's
> accidentally or intentionally powered off.
> 
> So, metadata updates must be carried out immediately, and they can bring
> the system to a halt while flushing the above 40MB of data, as opposed
> to the 500KB that needs to be updated in terms of what is actually being
> updated on disk.

Nope; POSIX does not require this.  As described above, there will be
a certain amount of file system updates that won't be completed if
someone kicks the power plug out of the wall and the system has an
unclean shutdown.

> So, the feature I'm looking for would be to say to the kernel: hey I'm
> about to batch 10000 operations, please be considerate, do your thing in
> one fell swoop while optimizing intermediate operations or writes to the
> disk, and there's no rush, so you may as well postpone the whole thing
> as much as you want.

This is what the kernel *always* does.  It's what all system has done
for decades, including in legacy Unix systems, and it's what people
have always expected and programmers who want something better (e.g.,
databases providing ACID guarantees) will use fsync(2) and explicitly
tell the kernel that reliability is more important than performance or
flash durability (all of the extra writes means extra write cycles,
leading to more write cycles and decreasing the lifespan of SSD's and
newer HDD's with HAMR or MAMR where the laser or masers built in the
HDD heads can wear out --- this is why newer HDD's have write limits
in their warantees.  Fortunately, this is what storage specialists at
hyperscaler cloud companies worry about so you don't have to.)

You can force metadata to be written out by using the sync(2),
fsync(2) or syncfs(2) system calls, but we don't optimize for the
uncommon case where someone might yank the power out or the kernel
crashes unexpectedly.

Cheers,

	       		       	   - Ted