Re: Performance regression with async inode inactivation

Jan Kara <jack@xxxxxxx> · Tue, 12 Oct 2021 15:42:55 +0200

On Fri 08-10-21 10:44:30, Dave Chinner wrote:
> On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote:
> > On Thu 07-10-21 08:58:51, Dave Chinner wrote:
> > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote:
> > > Hmmm, I didn't see this at first.  What's the filesystem layout
> > > (xfs_info) and the CLI for the test that you ran? How many CPUs was
> > > the test run on?
> > 
> > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs.
> > That's why I've run stress-unlink with 48 processes.
> > 
> > xfs_info is:
> > 
> > meta-data=/dev/sdb1              isize=512    agcount=4, agsize=29303104 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=0, rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=117212416, imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=57232, version=2
> 
> OK, default log is ~230MB, fs size is ~400GB?

Yep.

<snip log analysis>

> > We can see big differences in the amount of log writes as well as logged
> > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB,
> > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes
> > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference
> > of 186 MB of log IO matches well with the observed difference in the amount
> > of writes in block traces.
> > 
> > I'm not sure why the amount of logged blocks differs so much.
> 
> fsync() interactions.
> 
> On the original code, the two unlink transactions are temporally
> adjacent as both are in unlinkat() syscall context.  One is directly
> run by the syscall, the other in task_run context at syscall exit
> when processing the last reference of the file being dropped.
> 
> In general, that means the objects modified (inode cluster, AGI,
> inode, etc) are captured by the same CIL context and so aggregate in
> memory as a single change (essentially log dedupe). Then the next
> fsync() from some other context runs, pushing the CIL to disk and
> we only log those objects modified in unlink to the journal once.
> 
> With deferred activation, the two phases of unlink are temporally
> decoupled. We get a bunch of inodes running the first phase in
> unlink() context, but the second phase is run later in a tight loop
> from workqueue context. But this means that fsync()s occur between
> the two pahses, and hence the objects modified in the two phases of
> unlink are modified in two separate CIL contexts. Hence they get
> written to the log twice.
> 
> Depending on the way things work out, deferred inactivation also
> results in longer unlinked inode chains, resulting in more objects
> being logged per unlink than without deferred inactivation, as the
> inodes are added to the unlink chain and then immediately removed
> before any others are added. Hence deferred inode inactivation will
> increase the amount written to the log per unlink if the two phases
> of unlink as split across journal checkpoints.
> 
> IOWs, an increase in log writes for open-write-fsync-close-unlink
> workloads is not unexpected. But this workload isn't really a real
> world workload in any way - we generally don't do data integrity
> writes only to immediately throw the data away. :/
> 
> Remove the fsync and we do have a real world workload - temporary
> files for compiles, etc. Then everything should mostly get logged
> in the same CIL context because all transactions are run
> asynchronously and aggregate in memory until the CIL is full and
> checkpoints itself. Hence log writes shouldn't change very much at
> all for such workloads.

OK, that makes sense. Thanks for explanation. So to verify your theory,
I've removed fsync(2) from the test program. So now it is pure create,
write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000
loops of create-unlink to increase runtime (but the workload does barely
any writes so it should not matter wrt log):

default mkfs params:
	AVG	STDDEV
before	2.0380	0.1597
after	2.7356	0.4712

agcount=96, log size 512M
	AVG	STDDEV
before	1.0610	0.0227
after	1.2508	0.0218

So still notable regression with the async inactivation. With default mkfs
params we'd need more runs to get more reliable results (note the rather
high standard deviation) but with high AG count results show pretty stable
20% regression - so let's have a look at that.

Looking at xfs stats there are barely any differences between before &
after - 'after' writes a bit more to the log but it is ~1.5 MB over the
whole benchmark run, altogether spending some 8ms doing IO so that's not
it. Generally the workload seems to be CPU / memory bound now (it does
barely any IO). Perf shows we are heavily contending on some spinlock in
xfs_cil_commit() - I presume this is a xc_cil_lock. This actually happens
both before and after, but we seem to spend some more time there with async
inactivation. Likely this is related to work being done from worker
threads. Perf stats for comparison:

before
         51,135.08 msec cpu-clock                 #   47.894 CPUs utilized          
             4,699      context-switches          #    0.092 K/sec                  
               382      cpu-migrations            #    0.007 K/sec                  
             1,228      page-faults               #    0.024 K/sec                  
   128,884,972,351      cycles                    #    2.520 GHz                    
    38,517,767,839      instructions              #    0.30  insn per cycle         
     8,337,611,468      branches                  #  163.051 M/sec                  
        39,749,736      branch-misses             #    0.48% of all branches        
        25,225,109      cache-misses                                                

       1.067666170 seconds time elapsed

after
         65,353.43 msec cpu-clock                 #   47.894 CPUs utilized          
            43,737      context-switches          #    0.669 K/sec                  
             1,824      cpu-migrations            #    0.028 K/sec                  
             1,953      page-faults               #    0.030 K/sec                  
   155,144,150,867      cycles                    #    2.374 GHz                    
    45,280,145,337      instructions              #    0.29  insn per cycle         
    10,027,567,384      branches                  #  153.436 M/sec                  
        39,554,691      branch-misses             #    0.39% of all branches        
        30,203,567      cache-misses                                                

       1.364539400 seconds time elapsed

So we can see huge increase in context-switches, notable increase in
cache-misses, decrease in cycles/s so perhaps we are bouncing cache more?
Anyway I guess this is kind of expected due to the nature of async
inactivation, I just wanted to highlight that there are regressions without
fsync in the game as well.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR