On 6/8/22 8:53 PM, Dave Chinner wrote:
On Tue, Jun 07, 2022 at 05:42:29PM -0700, Chris Mason wrote:
iomap_do_writepage() sends pages past i_size through
folio_redirty_for_writepage(), which normally isn't a problem because
truncate and friends clean them very quickly.
When the system has cgroups configured, we can end up in situations
where one cgroup has almost no dirty pages at all, and other cgroups
consume the entire background dirty limit. This is especially common in
our XFS workloads in production because they have cgroups using O_DIRECT
for almost all of the IO mixed in with cgroups that do more traditional
buffered IO work.
We've hit storms where the redirty path hits millions of times in a few
seconds, on all a single file that's only ~40 pages long. This leads to
long tail latencies for file writes because the pdflush workers are
hogging the CPU from some kworkers bound to the same CPU.
Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new
eof page on truncate...") ends up writing/waiting most of these dirty pages
before truncate gets a chance to wait on them.
That commit went into 5.10, so this would mean it's not easily
reproducable on kernels released since then?
Yes, our main two prod kernels right now are v5.6 and v5.12, but we
don't have enough of this database tier on 5.12 to have any meaningful
data from production. For my repro, I didn't spend much time on 5.12,
but it was hard to trigger there as well.
[...]
Regardless, the change looks fine.
Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
Thanks! Johannes and I are both going on vacation, but I'll get an
experiment rolled to enough hosts to see if the long tails get shorter.
We're unlikely to come back with results before July.
-chris