Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

"Theodore Ts'o" <tytso@xxxxxxx> · Sat, 24 Feb 2024 16:42:45 -0500

On Sat, Feb 24, 2024 at 11:11:28AM -0800, Linus Torvalds wrote:
> But it is possible that this work never went anywhere exactly because
> this is such a rare case. That kind of "write so much that you want to
> do something special" is often such a special thing that using
> O_DIRECT is generally the trivial solution.

Well, actually there's a relatively common workload where we do this
exact same thing --- and that's when we run mkfs.ext[234] / mke2fs.
We issue a huge number of buffered writes (at least, if the device
doesn't support a zeroing discard operation) to zero out the inode
table.  We rely on the mm subsystem putting mke2fs "into the penalty
box", or else some process (usually mke2fs) will get OOM-killed.

I don't consider it a "penalty" --- in fact, when write throttling
doesn't work, I've complained that it's an mm bug.  (Sometimes this
has broken when the mke2fs process runs out of physical memory, and
sometimes it has broken when the mke2fs runs into the memory cgroup
limit; it's one of those things that's seems to break every 3-5
years.)  But still, it's something which *must* work, because it's
really not reasonable for userspace to know what is a reasonable rate
to self-throttling buffered writes --- it's something the kernel
should do for the userspace process.

Because this is something that has broken more than once, we have two
workarounds in mke2fs; one is that we can call fsync(2) every N block
group's worth of inode tables, which is kind of a hack, and the other
is that we can use Direct I/O.  But using DIO has a worse user
experience (well, unless the alternative is mke2fs getting OOM-killed;
admittedly that's worse) than just using buffered I/O, since we
generally don't need to synchronously wait for the write requests to
complete.  Neither is enabled by default, because in my view, this is
something the mm should just get right, darn it.

In any case, I definitely don't consider write throttled to be a
performance "problem" --- it's actually a far worse problem when the
throttling doesn't happen, because it generally means someone is
getting OOM-killed.

						- Ted