Re: Strange behavior with log IO fake-completions

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 17 Sep 2018 12:56:01 +1000

On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote:
> Thank you Dave, for sharing the details and putting the pieces together.
> 
> > So, why don't fake completions work well? It's because the IO delay
> > goes away completely and that causes submission/completion
> > processing contention in the log code. If we have lots of idle CPUs
> > sitting around with a synchronous log IO completion, we get this
> > fsync behaviour:
> >
> > fsync proc              CIL             IO-complete     log
> > xfs_log_force_lsn
> >   <wait for CIL>
> >                         xlog_write()
> >                           <queue to log-wq>
> >                         wake CIL waiters                xlog_iodone()
> >   <wait for iclogbuf>                                   wake iclogbug waiters
> >
> > log force done.
> >
> > IOWs, we now have the iclogbuf IO completion work racing with the
> > completion of the CIL push, finalising the log state after the
> > writes, waking CIL waiters and going to sleep waiting for log IO
> > completion and state machine transitions.
> >
> > Basically, it causes cross-CPU contention for the same log cachelines
> > and locks because we've triggered a synchronous IO completion of the
> > iclogbuf. That contention kills CPU performance because it's now
> > having to wait for the cache coherency protocol to fetch data from
> > other CPU caches and/or memory
> 
> Sorry, but the contention that you mentioned seems to be more about
> fast-completion-of-log-IO rather than
> synchronous-completion-of-log-IO. Isn't it?
> The code that gets executed in case of fake-completion would be
> exactly same had storage completed log IO blazing fast.

No, no it isn't. Blazing fast storage is not the same zero-delay, IO
bypass completion like your fake-completion hack. Especially
considering that it avoids sending REQ_FLUSH to the underlying
device on each fsync, which typically has some additional delay for
the IO that it is delivered with.

>From that perspective, fake-completion is actually causing out of
order completion of the journal IO vs the rest of the IO that is in
flight. i.e. because it doesn't go through the IO layer, scheduler,
etc, it's not actually being issued and completed with all the other
IO that is in progress. Hence it's basically skipping the pending IO
completion queue at all layers and jumping straight to the head of
the queue.  IOWs, it's in no way reflective of how the journal
behaves when there there is real IO load present on the underlying
device.

And now that I've mentioned REQ_FLUSH, I have to point out that
fake-completion is breaking the integrity model of the filesystem.
Failing to issue REQ_FLUSH on log IOs means that volatile caches
below the filesystem are not flushed before the journal is written
to. This also changes way the underlying device processes IO, and so
may be operating way faster than it should because it's not getting
regular cache flushes.

But even more important: to provide the correct data integrity
semantics to fsync() callers, we really do need to issue two IOs for
each fsync - both data and log for related metadata need to be
written. So lets go back to looking at where all our time in fsync
is being spent, and why fake-completion is not telling you what you
think it is.

With fast nvme storage (using your numbers of 3uS per IO) and
ignoring all the CPU time between the IOs, we could do ~160k
fsync's per second from a single thread on such a device due to the
6uS per fsync IOWait time constraint.

If we add in 1uS of CPU time to do the data write, and another 1uS
of CPU time for the fsync (i.e. all the journal write code) we're now
at 8uS per fsync and so down to 125k operations per second.

And then we've got application time to set up for the next
write+fsync, which if there's real data to be processed is going to
be a few uS too. But let's conisder that it's IOZone so it's just
feeding the same data in a tight loop which gives us roughly other
1uS per loop.  IOws, we're now at 9uS of CPU+IOwait time per
write/fsync loop, and that puts us at ~111k fsync's per second.

That's right on the number that IOZone is giving you for single
threaded performance. To make it go faster, we have to either reduce
CPU overhead or reduce IO time. We can't reduce IO time - we have to
do two physical IOs. That leaves us with reducing CPU time
processing the IO.

Your fake completion code does not just reduce the CPU time of the
journal processing slightly - it completely omits one of those two
IOs (the journal IO) that are required for fsync. IOWs, instead of
6uS of IOwait time, we now only have 3uS. That means we the
per-fsync time is back down to 6uS, which is somewhere around
160k fsync/s per second - that's roughly 40% faster.

And if we look at your single threaded 16 CPU iozone results we see:

> Threads	Base		Fake-completion	Diff%
> 1		110,804		151,478		36.7%

The change in performance is right on the number I'd expect from
eliding the log IO.  The measured improvement hasn't come from
fake-completion simulating lower CPU usage, it's come directly from
not issuing and waiting for journal IO.

And your high cpu count results confirm the contention problems
between submission and completion I described result in reduced
performance in low thread counts. IOws, your results are confirming
what I've been saying tends to happen with zero-delay IO completion
setups.

IOWs, we already know that removing the journal IO wait time have a
performance improvement, that it is visible even with really fast IO
subsystems, and we know we cannot make optimisations in this manner.
IOWs, your test doesn't tell us that there's a problem with
asynchronous log IO completion algorithms on asynhronous storage,
they just tell us that:

	a) the journal IO is ~only 35% of the fsync overhead on your
	device; and
	b) synchrnous and/or zero delay IO completions are not
	handled well, but that still only extends to ramdisks and
	pmem in practice.

> And I do not see any different/better numbers when I perform
> fake-completion in asynchronous way (i.e. increment b_io_remaining,
> and queue bp in a workqueue).

That's because you're not changing CPU usage in any meaningful way -
the change in performance you've measured is caused by eliding the
journal IO and scheduling immediate IO completion, not by the extra
context switch that queuing the log IO completion causes.

And, FWIW, workqueues can queue work onto other CPUs - if the work
is already running, then it will continue to be run on the CPU it is
running on, not get scheduled on the queuing CPU. When the XFS log
is busy, this happens a *lot*.

[snip]

> > Now, your other "unexpected" result - lets use affinity to confine
> > everything to one CPU:
> >
> > fsync proc              CIL             IO-complete     log
> > xfs_log_force_lsn
> >   <wait for CIL>
> >                         xlog_write()
> >                         <queue to log>
> >                         wake CIL waiters
> >                                                         xlog_iodone()
> >                                                         wake iclogbug waiters
> > log force done
> >
> > Note the difference? The log work can't start until the CIL work has
> > completed because it's constrained to the same CPU, so it doesn't
> > cause any contention with the finalising of the CIL push and waking
> > waiters.
> 
> "Log work can't start" part does not sound very good either. It needed
> to be done anyway before task waiting for fsync is woken.

I'm beginning to think you don't really understand how logging in
XFS works.  If you want to improve the logging subsystem, it would
be a good idea for you to read and understand this document so you
have some idea of how the data that is written to the log gets
there and what happens to it after it's in the log.

Documentation/filesystems/xfs-delayed-logging-design.txt

> > Perhaps it would be better for you to describe what you are trying
> > to acheive first so we can point you in the right direction rather
> > than playing "20 Questions" and still not understanding of why the
> > filesystem isn't doing what you expect it to be doing....
> 
> Single threaded IO is about latency. And storage-systems need to care
> about that apart from multi-threaded, throughput scenarios.

Who does single threaded write IO these days? What real world
application problem are you trying to solve?

IMO, If you care about single operation latency in isolation from
everything else, XFS is not the filesystem to use.  XFS's algorithms
are CPU intensive because they are designed to scale out to large
CPU counts efficiently. e.g. a single write+fsync on ext4 takes less
than 50% of the CPU time to execute and complete than XFS. So for
the sorts of single threaded fsync workloads you are talking about,
ext4 will always have much lower latency for single operations than
XFS because it burns much less CPU between waiting for IOs.

> To rule-out the device peculiarities and to determine where exactly
> problem lies, I chose to try the whole thing with fake completion.
> 
> I'd appreciate if you have suggestions about what could be changed in
> XFS to get advantage of low-latency log IO.

I still don't understand what application you have that requires
optimisation of single threaded fsync latency.  Without some context
of the application level problem you are trying to solve I can't
answer questions like this.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx