On Tue, Sep 11, 2018 at 09:40:59AM +0530, Joshi wrote: > > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( > > > '-eo' option). > > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M > > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each > > > > > > And results are : > > > ------------------------------------------------ > > > baseline log fake-completion > > > 109,845 45,538 > > > ------------------------------------------------ > > > I wondered why fake-completion turned out to be ~50% slower! > > > May I know if anyone encountered this before, or knows why this can happen? > > > > You made all log IO submission/completion synchronous. > > > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2 > > > > > For fake-completion, I just tag all log IOs bufer-pointers (in > > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged > > > log IOs without any real bio-formation (comment call to > > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing! > > > > It'll work, but it's a pretty silly thing to do. See above. > > Thank you very much, Dave. I feel things are somewhat different here > than other email-thread you pointed. > Only log IO was fake-completed, not that entire XFS volume was on > ramdisk. The "fake completion" forces log IO to be completed synchronously, just like a ramdisk does. It's exactly the same behaviour. > Underlying disk was NVMe, and I checked that no > merging/batching happened for log IO submissions in base case. > Completion count were same (as many as submitted) too. Of course. async or sync, the amount of IO work an isolated single threaded, synchronous write workload does is the same because there is no other IO that can be aggregated for either the data or journal writes.... > Call that I disabled in base code (in xfs_buf_submit, for log IOs) is > _xfs_buf_ioapply. Yes, I assumed you were doing something like that. The problem is, you're not looking at what the journal is doing as a whole when xfs_log_force_lsn() is called from fsync. In this case, the initial log state is that the current iclogbuf is ACTIVE because we have no journal writes in progress. The typical, full IO path through this code is as follows: fsync() process CIL workqueue log workqueue xfs_log_force_lsn() xlog_cil_force_lsn() xlog_cil_push_now(seq) <flushes pending push work> <queues required push work> xlog_wait(xc_commit_wait); <blocks waiting for CIL push> <walks items in CIL> <formats into log buffer> xlog_write() ..... iclogbuf goes to SYNCING xfs_buf_submit(iclogbuf) __xfs_buf_submit() submit_bio() <push work complete> <wakes push waiters> <wakes, returns to __xfs_log_force_lsn()> __xfs_log_force_lsn() <iclogbuf in SYNCING> xlog_wait(ic_force_wait) <blocks waiting for IO completion> <IO completion on some CPU> xfs_buf_endio_async() queue_work(log-wq) <wakes> xlog_iodone xlog_state_do_callback wake_up_all(ic_force_wait) <work complete> <wakes, returns to __xfs_log_force_lsn()> <log force done> <returns to fsync context> In normal cases, the overhead of IO completion queuing the xlog_iodone completion work to the log wq can be mostly ignored - it's just part of the IO completion delay time. What's important to realise is that it runs some time after the CIL push waiters were woken and then go back to sleep waiting for the log IO to complete, so the processing is completely separated and there's no contention between submission and completion. > So only thing happened differently for log IO > submitter thread is that it executed bp-completion-handling-code > (xfs_buf_ioend_async). And that anyway pushes the processing to > worker. It still remains mostly async, I suppose. Well, when andhow the completion runs depends on a *lot* of things. Let's simplify - the above case is: fsync proc CIL IO-complete log xfs_log_force_lsn <wait for CIL> xlog_write() wake CIL waiters <wait for iclogbuf> <queue to log> xlog_iodone() wake iclogbug waiters log force done. Essentially, there are 7 context switches in that, but we just don't see that overhead because it's tiny compared to the IO time. There's also no contention between the fsync proc waiting on iclogbuf io completion and iclogbuf completion processing, so that runs as fast as possible. > In original form, it > would have executed extra code to form/sent bio (possibility of > submission/completion merging, but that did not happen for this > workload), and completion would have come after some time say T. > I wondered about impact on XFS if this time T can be made very low by > underlying storage for certain IOs. Even when IO times are short as a few tens of microseconds, this is sufficient to avoid contention between submission and completion processing and turn the context switches into noise. And even with shorter IO delays, the async processing still allows the log state machine to aggregate concurrent fsyncs - see the recent work with AIO fsync to scale out to concurrently writing and fsyncing tens of thousands of files per second on high speed nvme SSDs. So, why don't fake completions work well? It's because the IO delay goes away completely and that causes submission/completion processing contention in the log code. If we have lots of idle CPUs sitting around with a synchronous log IO completion, we get this fsync behaviour: fsync proc CIL IO-complete log xfs_log_force_lsn <wait for CIL> xlog_write() <queue to log-wq> wake CIL waiters xlog_iodone() <wait for iclogbuf> wake iclogbug waiters log force done. IOWs, we now have the iclogbuf IO completion work racing with the completion of the CIL push, finalising the log state after the writes, waking CIL waiters and going to sleep waiting for log IO completion and state machine transitions. Basically, it causes cross-CPU contention for the same log cachelines and locks because we've triggered a synchronous IO completion of the iclogbuf. That contention kills CPU performance because it's now having to wait for the cache coherency protocol to fetch data from other CPU caches and/or memory. This can easily slow critical code down by a factor of 3-4 orders of magnitude. Racing/spurious sleep/wakeup events have a fair bit of overhead, too. Now, your other "unexpected" result - lets use affinity to confine everything to one CPU: fsync proc CIL IO-complete log xfs_log_force_lsn <wait for CIL> xlog_write() <queue to log> wake CIL waiters xlog_iodone() wake iclogbug waiters log force done Note the difference? The log work can't start until the CIL work has completed because it's constrained to the same CPU, so it doesn't cause any contention with the finalising of the CIL push and waking waiters. Further, because the log workqueue is a high priority workqueue and we're confined to a single CPU, it gets processed by the workqueue infrastructure before it releases the CPU to schedule other tasks. This changes the order of operations in the code. And because it's all on the one CPU, this is effectively a CPU bound operation and the cachelines stay hot in the CPU and there's no lock contention to invalidate them. The code runs as fast as it is possible to run as a result. Combine those things and the result is that we do less work and the CPU executes it faster. Not to mention that only running on one CPU core allows modern CPUs to maximise the clock frequency on that one core, making the affinity constrained workload look even faster than one that is spread across multiple cores. (Reliable performance benchmarking is hard!) > If underlying device/layers provide some sort of differentiated I/O > service enabling ultra-low-latency completion for certain IOs (flagged > as urgent), and one chooses log IO to take that low-latency path - > won't we see same problem as shown by fake-completion? Maybe, maybe not. It really depends on what is happening system wide, not what is happening to any single thread or IO.... IOWs, I wouldn't be trying to extrapolate how a filesystem will react to a specific type of storage from a single threaded, synchronous write workload. All you'll succeed in doing is micro-optimising behaviour for single threaded, synchronous writes on that specific storage device. Perhaps it would be better for you to describe what you are trying to acheive first so we can point you in the right direction rather than playing "20 Questions" and still not understanding of why the filesystem isn't doing what you expect it to be doing.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx