Re: Strange behavior with log IO fake-completions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you Dave, for sharing the details and putting the pieces together.

> So, why don't fake completions work well? It's because the IO delay
> goes away completely and that causes submission/completion
> processing contention in the log code. If we have lots of idle CPUs
> sitting around with a synchronous log IO completion, we get this
> fsync behaviour:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                           <queue to log-wq>
>                         wake CIL waiters                xlog_iodone()
>   <wait for iclogbuf>                                   wake iclogbug waiters
>
> log force done.
>
> IOWs, we now have the iclogbuf IO completion work racing with the
> completion of the CIL push, finalising the log state after the
> writes, waking CIL waiters and going to sleep waiting for log IO
> completion and state machine transitions.
>
> Basically, it causes cross-CPU contention for the same log cachelines
> and locks because we've triggered a synchronous IO completion of the
> iclogbuf. That contention kills CPU performance because it's now
> having to wait for the cache coherency protocol to fetch data from
> other CPU caches and/or memory

Sorry, but the contention that you mentioned seems to be more about
fast-completion-of-log-IO rather than
synchronous-completion-of-log-IO. Isn't it?
The code that gets executed in case of fake-completion would be
exactly same had storage completed log IO blazing fast.
And I do not see any different/better numbers when I perform
fake-completion in asynchronous way (i.e. increment b_io_remaining,
and queue bp in a workqueue).

I gathered some more numbers with iozone, and with filebench's varmail
(another sync-heavy benchmark). With 48 and 16 cpus (reduced through
nr_cpus kernel parameter).
Numbers indicate that whatever contention was applicable for 48 cpus,
does seem to vanish while running with 16 cpus. In fact,
fake-completion outperforms base with good margins while running with
16 cpus.

************************ Varmail (nr_cpus = 48 vs nr_cpus=16)
***************************************
Threads              Base                       Fake-completion
     Diff% (fake-completion vs base)
1                53,662   / 55,815              38,322  / 63,919
      -28.6%  / 14.5%
2                89,382   / 80,196              84,183  / 94,269
      -5.8%    / 17.5%
4                148,953 / 144,781           141,030 / 163,434
  -5.3%    / 12.9%
8                230,970 /  226,314           230,410/  243,300
  -0.2%    /  7.5%
16              235,077 /  242,282           234,817/  244,312
 -0.1%    /  0.8%
32              235,752 /  243,885           233,517/  245,685
 -0.9%    /  1.0%
64              234,659 /  240,872           232,855/  243,884
 -0.8%    /  1.3%

************************ Iozone (nr_cpus = 48 vs nr_cpus=16)
****************************************
Threads              Base                          Fake-completion
        Diff% (fake-completion vs base)
1                99,691    / 110,804              45,801   / 151,478
           -54.1%  / 36.7%
2                182,221  / 188,503              145,968 / 227,948
         -19.8%  / 20.9%
4                336,585  / 333,111              326,697 / 392,058
          -2.9%    / 17.7%
8                553,251  /  544,356             621,871 / 677,801
         12.4%    / 24.5%
16              768,023  /  779,373             944,148 / 964,868
        22.9%    / 23.8%
32              934,050  /  959,441             1088,719 / 1094,018
      16.6%    / 14%
64              1053,738 /  1015,464          1086,711 / 1045,853
     3.1%      /  3%

> Now, your other "unexpected" result - lets use affinity to confine
> everything to one CPU:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                         <queue to log>
>                         wake CIL waiters
>                                                         xlog_iodone()
>                                                         wake iclogbug waiters
> log force done
>
> Note the difference? The log work can't start until the CIL work has
> completed because it's constrained to the same CPU, so it doesn't
> cause any contention with the finalising of the CIL push and waking
> waiters.

"Log work can't start" part does not sound very good either. It needed
to be done anyway before task waiting for fsync is woken.
And even without affinity thing, work-queue infrastructure executes
deferred work on the same cpu where it was queued.
At least on 4.15, I see that if I queue work on cpu X, kworker running
on cpu X picks it up and processes.
So I fancy that for a single fsync IO, all the workers that needed to
be executed would have executed on same cpu, even without affinity.

 > > If underlying device/layers provide some sort of differentiated I/O
> > service enabling ultra-low-latency completion for certain IOs (flagged
> > as urgent), and one chooses log IO to take that low-latency path  -
> > won't we see same problem as shown by fake-completion?
>
> Maybe, maybe not. It really depends on what is happening system
> wide, not what is happening to any single thread or IO....
>
> IOWs, I wouldn't be trying to extrapolate how a filesystem
> will react to a specific type of storage from a single threaded,
> synchronous write workload. All you'll succeed in doing is
> micro-optimising behaviour for single threaded, synchronous
> writes on that specific storage device.
>
> Perhaps it would be better for you to describe what you are trying
> to acheive first so we can point you in the right direction rather
> than playing "20 Questions" and still not understanding of why the
> filesystem isn't doing what you expect it to be doing....

Single threaded IO is about latency. And storage-systems need to care
about that apart from multi-threaded, throughput scenarios.
In the above varmail/iozone data, XFS seems to be friendlier to
throughput than to latency. After 8 or more threads, XFS transaction
speed became independent of log IO speed (which is a good thing). But
below 8, it responded odd to log-speed-increase.
Underlying storage (NVMe) allowed me to complete certain IOs with a
very low latency (around ~3 microsecond for 4K ). And I leveraged that
facility to complete log-IO with very low-latency.
With the expectation that latency of single-threaded fsync IOs would
be improved as a result of that. I hope it is not too unnatural to
expect that.
To rule-out the device peculiarities and to determine where exactly
problem lies, I chose to try the whole thing with fake completion.

I'd appreciate if you have suggestions about what could be changed in
XFS to get advantage of low-latency log IO.

Thanks,
On Wed, Sep 12, 2018 at 6:13 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Sep 11, 2018 at 09:40:59AM +0530, Joshi wrote:
> > > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
> > > > '-eo' option).
> > > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
> > > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each
> > > >
> > > > And results are :
> > > > ------------------------------------------------
> > > > baseline                log fake-completion
> > > > 109,845                 45,538
> > > > ------------------------------------------------
> > > > I wondered why fake-completion turned out to be ~50% slower!
> > > > May I know if anyone encountered this before, or knows why this can happen?
> > >
> > > You made all log IO submission/completion synchronous.
> > >
> > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2
> > >
> > > > For fake-completion, I just tag all log IOs bufer-pointers (in
> > > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged
> > > > log IOs without any real bio-formation (comment call to
> > > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing!
> > >
> > > It'll work, but it's a pretty silly thing to do. See above.
> >
> > Thank you very much, Dave. I feel things are somewhat different here
> > than other email-thread you pointed.
> > Only log IO was fake-completed, not that entire XFS volume was on
> > ramdisk.
>
> The "fake completion" forces log IO to be completed synchronously,
> just like a ramdisk does. It's exactly the same behaviour.
>
> > Underlying disk was NVMe, and I checked that no
> > merging/batching happened for log IO submissions in base case.
> > Completion count were same (as many as submitted) too.
>
> Of course. async or sync, the amount of IO work an isolated single
> threaded, synchronous write workload does is the same because there
> is no other IO that can be aggregated for either the data or journal
> writes....
>
> > Call that I disabled in base code (in xfs_buf_submit, for log IOs) is
> > _xfs_buf_ioapply.
>
> Yes, I assumed you were doing something like that. The problem is,
> you're not looking at what the journal is doing as a whole when
> xfs_log_force_lsn() is called from fsync.
>
> In this case, the initial log state is that the current iclogbuf is
> ACTIVE because we have no journal writes in progress. The typical,
> full IO path through this code is as follows:
>
> fsync() process                 CIL workqueue           log workqueue
>
> xfs_log_force_lsn()
>   xlog_cil_force_lsn()
>     xlog_cil_push_now(seq)
>         <flushes pending push work>
>         <queues required push work>
>     xlog_wait(xc_commit_wait);
>         <blocks waiting for CIL push>
>
>                                 <walks items in CIL>
>                                 <formats into log buffer>
>                                 xlog_write()
>                                   .....
>                                   iclogbuf goes to SYNCING
>                                   xfs_buf_submit(iclogbuf)
>                                     __xfs_buf_submit()
>                                         submit_bio()
>                                 <push work complete>
>                                 <wakes push waiters>
>
> <wakes, returns to __xfs_log_force_lsn()>
>   __xfs_log_force_lsn()
>     <iclogbuf in SYNCING>
>     xlog_wait(ic_force_wait)
>       <blocks waiting for IO completion>
>
>                                 <IO completion on some CPU>
>                                    xfs_buf_endio_async()
>                                      queue_work(log-wq)
>                                                         <wakes>
>                                                         xlog_iodone
>                                                           xlog_state_do_callback
>                                                             wake_up_all(ic_force_wait)
>                                                         <work complete>
> <wakes, returns to __xfs_log_force_lsn()>
> <log force done>
> <returns to fsync context>
>
> In normal cases, the overhead of IO completion queuing the
> xlog_iodone completion work to the log wq can be mostly ignored -
> it's just part of the IO completion delay time. What's important to
> realise is that it runs some time after the CIL push waiters were
> woken and then go back to sleep waiting for the log IO to complete,
> so the processing is completely separated and there's no contention
> between submission and completion.
>
> > So only thing happened differently for log IO
> > submitter thread is that it executed bp-completion-handling-code
> > (xfs_buf_ioend_async). And that anyway pushes the processing to
> > worker. It still remains mostly async, I suppose.
>
> Well, when andhow the completion runs depends on a *lot* of things.
> Let's simplify - the above case is:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                         wake CIL waiters
>   <wait for iclogbuf>
>                                         <queue to log>
>                                                         xlog_iodone()
>                                                         wake iclogbug waiters
> log force done.
>
> Essentially, there are 7 context switches in that, but we just don't
> see that overhead because it's tiny compared to the IO time. There's
> also no contention between the fsync proc waiting on iclogbuf io
> completion and iclogbuf completion processing, so that runs as fast
> as possible.
>
> > In original form, it
> > would have executed extra code to form/sent bio (possibility of
> > submission/completion merging, but that did not happen for this
> > workload), and completion would have come after some time say T.
> > I wondered about impact on XFS if this time T can be made very low by
> > underlying storage for certain IOs.
>
> Even when IO times are short as a few tens of microseconds, this is
> sufficient to avoid contention between submission and completion
> processing and turn the context switches into noise. And even with
> shorter IO delays, the async processing still allows the log state
> machine to aggregate concurrent fsyncs - see the recent
> work with AIO fsync to scale out to concurrently writing and
> fsyncing tens of thousands of files per second on high speed nvme
> SSDs.
>
> So, why don't fake completions work well? It's because the IO delay
> goes away completely and that causes submission/completion
> processing contention in the log code. If we have lots of idle CPUs
> sitting around with a synchronous log IO completion, we get this
> fsync behaviour:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                           <queue to log-wq>
>                         wake CIL waiters                xlog_iodone()
>   <wait for iclogbuf>                                   wake iclogbug waiters
>
> log force done.
>
> IOWs, we now have the iclogbuf IO completion work racing with the
> completion of the CIL push, finalising the log state after the
> writes, waking CIL waiters and going to sleep waiting for log IO
> completion and state machine transitions.
>
> Basically, it causes cross-CPU contention for the same log cachelines
> and locks because we've triggered a synchronous IO completion of the
> iclogbuf. That contention kills CPU performance because it's now
> having to wait for the cache coherency protocol to fetch data from
> other CPU caches and/or memory. This can easily slow critical code
> down by a factor of 3-4 orders of magnitude. Racing/spurious
> sleep/wakeup events have a fair bit of overhead, too.
>
> Now, your other "unexpected" result - lets use affinity to confine
> everything to one CPU:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                         <queue to log>
>                         wake CIL waiters
>                                                         xlog_iodone()
>                                                         wake iclogbug waiters
> log force done
>
> Note the difference? The log work can't start until the CIL work has
> completed because it's constrained to the same CPU, so it doesn't
> cause any contention with the finalising of the CIL push and waking
> waiters.
>
> Further, because the log workqueue is a high priority workqueue and
> we're confined to a single CPU, it gets processed by the workqueue
> infrastructure before it releases the CPU to schedule other tasks.
> This changes the order of operations in the code.
>
> And because it's all on the one CPU, this is effectively a CPU bound
> operation and the cachelines stay hot in the CPU and there's no lock
> contention to invalidate them. The code runs as fast as it is
> possible to run as a result.
>
> Combine those things and the result is that we do less work and the
> CPU executes it faster. Not to mention that only running on one CPU
> core allows modern CPUs to maximise the clock frequency on that one
> core, making the affinity constrained workload look even faster than
> one that is spread across multiple cores. (Reliable performance
> benchmarking is hard!)
>
> > If underlying device/layers provide some sort of differentiated I/O
> > service enabling ultra-low-latency completion for certain IOs (flagged
> > as urgent), and one chooses log IO to take that low-latency path  -
> > won't we see same problem as shown by fake-completion?
>
> Maybe, maybe not. It really depends on what is happening system
> wide, not what is happening to any single thread or IO....
>
> IOWs, I wouldn't be trying to extrapolate how a filesystem
> will react to a specific type of storage from a single threaded,
> synchronous write workload. All you'll succeed in doing is
> micro-optimising behaviour for single threaded, synchronous
> writes on that specific storage device.
>
> Perhaps it would be better for you to describe what you are trying
> to acheive first so we can point you in the right direction rather
> than playing "20 Questions" and still not understanding of why the
> filesystem isn't doing what you expect it to be doing....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx



-- 
Joshi



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux