On Mon, Jan 13, 2025 at 07:41:53AM +0000, Artem S. Tashkinov wrote: > Let's say you chmod two files sequentially: > > What's happening: > > 1) the kernel looks up an inode > 2) the kernel then updates the metadata (at the very least one logical > block is being written, that's 4K bytes) > > ditto for the second file. > > Now let's say we are updating 10000 files. > > Does this mean that at least 40MB of data will be written, when probably > less than 500KB needs to be written to disk? No, for pretty much all file systems, we don't force the data to be written when you do a chmod. This is true for both journalled and non-journalled file systems. For a non-journalled file system, we will modify the in-memory inode structure, and we will wait until the buffer cache writeback (typically 30 seconds after the block was first dirtied) before the metadata block is written back. So if you modify the same block 32 times for 32 inodes, it won't get written until 30 seconds go by. So as long as you actually complete the chmod -R operation within 30 seconds, you'll be fine. For non-journalled file system, the disk writes won't take place until the transaction close time takes place, which is 5 seconds (by default) before the transaction closes. > == Issue number two == > > At least when you write data to the disk, the kernel doesn't flush it > immediately and your system remains responsive due to the use of dirty > buffers. > > For operations involving metadata updates, the kernel may not have this > luxury, because the system must be in a consistent state even if it's > accidentally or intentionally powered off. > > So, metadata updates must be carried out immediately, and they can bring > the system to a halt while flushing the above 40MB of data, as opposed > to the 500KB that needs to be updated in terms of what is actually being > updated on disk. Nope; POSIX does not require this. As described above, there will be a certain amount of file system updates that won't be completed if someone kicks the power plug out of the wall and the system has an unclean shutdown. > So, the feature I'm looking for would be to say to the kernel: hey I'm > about to batch 10000 operations, please be considerate, do your thing in > one fell swoop while optimizing intermediate operations or writes to the > disk, and there's no rush, so you may as well postpone the whole thing > as much as you want. This is what the kernel *always* does. It's what all system has done for decades, including in legacy Unix systems, and it's what people have always expected and programmers who want something better (e.g., databases providing ACID guarantees) will use fsync(2) and explicitly tell the kernel that reliability is more important than performance or flash durability (all of the extra writes means extra write cycles, leading to more write cycles and decreasing the lifespan of SSD's and newer HDD's with HAMR or MAMR where the laser or masers built in the HDD heads can wear out --- this is why newer HDD's have write limits in their warantees. Fortunately, this is what storage specialists at hyperscaler cloud companies worry about so you don't have to.) You can force metadata to be written out by using the sync(2), fsync(2) or syncfs(2) system calls, but we don't optimize for the uncommon case where someone might yank the power out or the kernel crashes unexpectedly. Cheers, - Ted