Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 1 Oct 2019 09:13:36 -0400

On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > 
> > > > In certain situations the background CIL push can be indefinitely
> > > > delayed. While we have workarounds from the obvious cases now, it
> > > > doesn't solve the underlying issue. This issue is that there is no
> > > > upper limit on the CIL where we will either force or wait for
> > > > a background push to start, hence allowing the CIL to grow without
> > > > bound until it consumes all log space.
> > > > 
> > > > To fix this, add a new wait queue to the CIL which allows background
> > > > pushes to wait for the CIL context to be switched out. This happens
> > > > when the push starts, so it will allow us to block incoming
> > > > transaction commit completion until the push has started. This will
> > > > only affect processes that are running modifications, and only when
> > > > the CIL threshold has been significantly overrun.
> > > > 
> > > > This has no apparent impact on performance, and doesn't even trigger
> > > > until over 45 million inodes had been created in a 16-way fsmark
> > > > test on a 2GB log. That was limiting at 64MB of log space used, so
> > > > the active CIL size is only about 3% of the total log in that case.
> > > > The concurrent removal of those files did not trigger the background
> > > > sleep at all.
> > > > 
> > > 
> > > Have you done similar testing for small/minimum sized logs?
> > 
> > Yes. I've had the tracepoint active during xfstests runs on test
> > filesystems using default log sizes on 5-15GB filesystems. The only
> > test in all of xfstests that has triggered it is generic/017, and it
> > only triggered once.
> > 
> > e.g.
> > 
> > # trace-cmd start -e xfs_log_cil_wait
> > <run xfstests>
> > # trace-cmd show
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 1/1   #P:4
> > #
> > #                              _-----=> irqs-off
> > #                             / _----=> need-resched
> > #                            | / _---=> hardirq/softirq
> > #                            || / _--=> preempt-depth
> > #                            ||| /     delay
> > #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> > #              | |       |   ||||       |         |
> >           xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
> > #
> > 
> > And the timestamp matched the time that generic/017 was running.
> 
> SO I've run this on my typical 16-way fsmark workload with different
> size logs. It barely triggers on log sizes larger than 64MB, on 32MB
> logs I can see it capturing all 16 fsmark processes while waiting
> for the CIL context to switch. This will give you an idea of the
> log cycles the capture is occuring on, and the count of processes
> being captured:
> 
> $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
>      16 251
>      32 475
>      16 494
>      32 870
>      15 1132
>      15 1166
>      14 1221
>       1 1222
>      16 1223
>       7 1307
>       8 1308
>      16 1315
>      16 1738
>      16 1832
>       9 2167
>       7 2168
>      16 2200
>      16 2375
>      16 2383
>      16 2700
>      16 2797
>      16 2798
>      16 2892
> $

Thanks. I assume I'm looking at cycle numbers and event counts here?

> 
> So typically groups of captures are hundreds of log cycles apart
> (100 cycles x 32MB = ~3GB of log writes), then there will be a
> stutter where the CIL dispatch is delayed, and then everything
> continues on. These all show the log is always around the 75% full
> (AIL tail pushing theshold) but the reservation grant wait lists are
> always empty so we're not running out of reservation space here.
> 

It's somewhat interesting that we manage to block every thread most of
the time before the CIL push task starts. I wonder a bit if that pattern
would hold for a system/workload with more CPUs (and if so, if there are
any odd side effects of stalling and waking hundreds of tasks at the
same time vs. our traditional queuing behavior).

> If I make the log even smaller - 16MB - the log is always full, the
> AIL is always tail pushing, and there is a constant stream of log
> forces (30-40/s) because tail pushing is hitting pinned items
> several thousand times a second.  IOWs, the frequency of the log
> forces means that CIL is almost never growing large enough to do a
> background push, let alone overrun the blocking threshold. Same
> trace for the same workload as above:
> 
> $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
>      16 1400
>      16 5284
>      16 5624
>      16 5778
>      16 6159
>      10 6477
> $
> 
> So when we have lots of concurrency and modification, tiny logs
> appear to be less susceptible to CIL overruns than small logs
> because they are constantly tail pushing and issuing log forces that
> trigger trigger flushes of the CIL before an overruns could occur.
> 

Makes sense, thanks.

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx