I added some timing measurement code and obtained data for "dd" sync requests (dd if=/dev/zero of=/mnt/f1 bs=4k count=1000 oflag=sync). Let me define these times - T1= xfs_log_force_lsn time (loosely, this is overall time to complete one fsync request) And within xfs_log_foce_lsn , these were added - T2 = xlog_cil_force_lsn time T3 = xlog_wait time for ic_force_wait Last one is this - T4 = Time elapsed between issuing wake-up on ic_force_wait (during completion) and thread waiting on ic_force_wait (dd in this case) resumed execution Basically T1 = T2+T3 +some_more, and T3 = T4 + some_more. Here is the data for 1 dd-sync request (trend remains mostly same for all requests) - Base case: T1= 37096, T2=11070, T3=19130, T4=2410 FC case: T1= 101206, T2=47150, T3=45443, T4=29483 T4 in fake-completion case seemed particularly odd to me. "dd" took so much time to resume even after wake up was issued on ic_force_wait! Looks like fast log completion is making xfs wait queues behave odd. I changed all log wait queues (ic_force_wait, ic_write_wait, and xc_commit_wait) into completion variables (despite knowing that it is also implemented using wait_queue). And it brought down the value of T4. Overall numbers with iozone/varmail look like this - Iozone(48 cpus) Threads Base FC 1 149149.2 151444.5 2 251794.5 255033.1 Varmail (48 cpus) Threads Base FC 1 61500.9 62396.4 2 102834.9 103080.1 However, base case numbers also got improved. Earlier base numbers were comparatively very low with 48 cpus. Thanks, On Fri, Sep 21, 2018 at 5:53 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote: > > On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote: > > > > Now, your other "unexpected" result - lets use affinity to confine > > > > everything to one CPU: > > > > > > > > fsync proc CIL IO-complete log > > > > xfs_log_force_lsn > > > > <wait for CIL> > > > > xlog_write() > > > > <queue to log> > > > > wake CIL waiters > > > > xlog_iodone() > > > > wake iclogbug waiters > > > > log force done > > > > > > > > Note the difference? The log work can't start until the CIL work has > > > > completed because it's constrained to the same CPU, so it doesn't > > > > cause any contention with the finalising of the CIL push and waking > > > > waiters. > > > > > > "Log work can't start" part does not sound very good either. It needed > > > to be done anyway before task waiting for fsync is woken. > > > > I'm beginning to think you don't really understand how logging in > > XFS works. If you want to improve the logging subsystem, it would > > be a good idea for you to read and understand this document so you > > have some idea of how the data that is written to the log gets > > there and what happens to it after it's in the log. > > > > Documentation/filesystems/xfs-delayed-logging-design.txt > > It occurred to me yesterday that looking at the journal in a > different way might help explain how it all works. > > Instead of looking at it as an IO engine, think of how an > out-of-order CPU is designed to be made up of multiple stages in a > pipeline - each stage does a small piece of work that it passes on > to the next stage to process. Individual operation progresses serially > through the pipeline, but, each stage of the pipeline can be > operating on a different operation. Hence we can have multiple > operations in flight at once, and the operations can also be run > out of order as dynamical stage completion scheduling dictates. > However, from a high level everything appears to complete in order > because of re-ordering stages put everythign in order once the > indiivdual operations have been executed. > > Similarly, the XFS journalling subsystem is an out of order, > multi-stage pipeline with a post-IO re-ordering stage to ensure the > end result is that individual operations always appear to complete > in order. Indeed, what ends up on disk in the journal is not in > order, so one of the things that log recovery does is rebuild the > state necessarily to reorder operations correctly before replay so > that, again, it appears like everything occurred in the order that > the transactions were committed to the journal. > > So perhaps looking at it as a multi-stage pipeline might also help > explain why fake-completion changes the behaviour in unpredictable > ways. i.e. it basically chops out stages of the pipeline, changing > the length of the pipeline and the order in which stages of the > pipleine are executed. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx -- Joshi