Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

"Theodore Ts'o" <tytso@xxxxxxx> · Tue, 21 May 2019 15:10:33 -0400

On Tue, May 21, 2019 at 02:19:53PM -0400, Josef Bacik wrote:
> Chris is adding a REQ_ROOT (or something) flag that means don't throttle me now,
> but the the blkcg attached to the bio is the one that is responsible for this
> IO.  Then for io.latency we'll let the io go through unmolested but it gets
> counted to the right cgroup, and if then we're exceeding latency guarantees we
> have the ability to schedule throttling for that cgroup in a safer place.  This
> would eliminate the data=ordered issue for ext4, you guys keep doing what you
> are doing and we'll handle throttling elsewhere, just so long as the bio's are
> tagged with the correct source then all is well.  Thanks,

Great, it sounds like Chris also came up with the the entangled writes
flag idea (although with probably a better name than I did :-).  So
now all we need to do is to plumb a flag through the writeback code so
that file systems (or the VFS player) implementing syncfs(2) or
fsync(2) can arrange to have that flag set if necessary.

Speaking of syncfs(2), something which we considered doing at Google
many years ago (but never did) was to implement a hack so that someone
calling syncfs(2) or sync(2) when they were not root, would make that
sys call be a no-op.  The reason for this was on heavy loaded
machines, an SRE logged in as a non-root user might absent-mindly type
"sync", and that would cause a storm of I/O traffic that would really
mess up the machine.  The jobs that were in the low latency bucket
would be protected (since we didn't run with journalling), but those
that were in the best efforts bucket would be really unhappy.

If we have a "don't throttle me now" REQ_ROOT flag combined with
journalling, then someone running "sync", even if it's by accident,
could really ruin a low-latency job's day, and in a container
environment, there really is no reason for a non-root user to be
wanting to request a syncfs(2) or sync(2).  So maybe we should have a
way to make it be a no-op (or return an error, but that might surprise
some applications) for non-privileged users.  Maybe as a per-mount
flag/option, or via some other tunable?

						- Ted