On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > > > In certain situations the background CIL push can be indefinitely > > > > delayed. While we have workarounds from the obvious cases now, it > > > > doesn't solve the underlying issue. This issue is that there is no > > > > upper limit on the CIL where we will either force or wait for > > > > a background push to start, hence allowing the CIL to grow without > > > > bound until it consumes all log space. > > > > > > > > To fix this, add a new wait queue to the CIL which allows background > > > > pushes to wait for the CIL context to be switched out. This happens > > > > when the push starts, so it will allow us to block incoming > > > > transaction commit completion until the push has started. This will > > > > only affect processes that are running modifications, and only when > > > > the CIL threshold has been significantly overrun. > > > > > > > > This has no apparent impact on performance, and doesn't even trigger > > > > until over 45 million inodes had been created in a 16-way fsmark > > > > test on a 2GB log. That was limiting at 64MB of log space used, so > > > > the active CIL size is only about 3% of the total log in that case. > > > > The concurrent removal of those files did not trigger the background > > > > sleep at all. > > > > > > > > > > Have you done similar testing for small/minimum sized logs? > > > > Yes. I've had the tracepoint active during xfstests runs on test > > filesystems using default log sizes on 5-15GB filesystems. The only > > test in all of xfstests that has triggered it is generic/017, and it > > only triggered once. > > > > e.g. > > > > # trace-cmd start -e xfs_log_cil_wait > > <run xfstests> > > # trace-cmd show > > # tracer: nop > > # > > # entries-in-buffer/entries-written: 1/1 #P:4 > > # > > # _-----=> irqs-off > > # / _----=> need-resched > > # | / _---=> hardirq/softirq > > # || / _--=> preempt-depth > > # ||| / delay > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > > # | | | |||| | | > > xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 > > # > > > > And the timestamp matched the time that generic/017 was running. > > SO I've run this on my typical 16-way fsmark workload with different > size logs. It barely triggers on log sizes larger than 64MB, on 32MB > logs I can see it capturing all 16 fsmark processes while waiting > for the CIL context to switch. This will give you an idea of the > log cycles the capture is occuring on, and the count of processes > being captured: > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c > 16 251 > 32 475 > 16 494 > 32 870 > 15 1132 > 15 1166 > 14 1221 > 1 1222 > 16 1223 > 7 1307 > 8 1308 > 16 1315 > 16 1738 > 16 1832 > 9 2167 > 7 2168 > 16 2200 > 16 2375 > 16 2383 > 16 2700 > 16 2797 > 16 2798 > 16 2892 > $ Thanks. I assume I'm looking at cycle numbers and event counts here? > > So typically groups of captures are hundreds of log cycles apart > (100 cycles x 32MB = ~3GB of log writes), then there will be a > stutter where the CIL dispatch is delayed, and then everything > continues on. These all show the log is always around the 75% full > (AIL tail pushing theshold) but the reservation grant wait lists are > always empty so we're not running out of reservation space here. > It's somewhat interesting that we manage to block every thread most of the time before the CIL push task starts. I wonder a bit if that pattern would hold for a system/workload with more CPUs (and if so, if there are any odd side effects of stalling and waking hundreds of tasks at the same time vs. our traditional queuing behavior). > If I make the log even smaller - 16MB - the log is always full, the > AIL is always tail pushing, and there is a constant stream of log > forces (30-40/s) because tail pushing is hitting pinned items > several thousand times a second. IOWs, the frequency of the log > forces means that CIL is almost never growing large enough to do a > background push, let alone overrun the blocking threshold. Same > trace for the same workload as above: > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c > 16 1400 > 16 5284 > 16 5624 > 16 5778 > 16 6159 > 10 6477 > $ > > So when we have lots of concurrency and modification, tiny logs > appear to be less susceptible to CIL overruns than small logs > because they are constantly tail pushing and issuing log forces that > trigger trigger flushes of the CIL before an overruns could occur. > Makes sense, thanks. Brian > Cheers, > > Dave. > > -- > Dave Chinner > david@xxxxxxxxxxxxx