On Fri 08-10-21 10:44:30, Dave Chinner wrote: > On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote: > > On Thu 07-10-21 08:58:51, Dave Chinner wrote: > > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote: > > > Hmmm, I didn't see this at first. What's the filesystem layout > > > (xfs_info) and the CLI for the test that you ran? How many CPUs was > > > the test run on? > > > > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs. > > That's why I've run stress-unlink with 48 processes. > > > > xfs_info is: > > > > meta-data=/dev/sdb1 isize=512 agcount=4, agsize=29303104 blks > > = sectsz=512 attr=2, projid32bit=1 > > = crc=1 finobt=1, sparse=0, rmapbt=0 > > = reflink=0 > > data = bsize=4096 blocks=117212416, imaxpct=25 > > = sunit=0 swidth=0 blks > > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > > log =internal log bsize=4096 blocks=57232, version=2 > > OK, default log is ~230MB, fs size is ~400GB? Yep. <snip log analysis> > > We can see big differences in the amount of log writes as well as logged > > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB, > > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes > > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference > > of 186 MB of log IO matches well with the observed difference in the amount > > of writes in block traces. > > > > I'm not sure why the amount of logged blocks differs so much. > > fsync() interactions. > > On the original code, the two unlink transactions are temporally > adjacent as both are in unlinkat() syscall context. One is directly > run by the syscall, the other in task_run context at syscall exit > when processing the last reference of the file being dropped. > > In general, that means the objects modified (inode cluster, AGI, > inode, etc) are captured by the same CIL context and so aggregate in > memory as a single change (essentially log dedupe). Then the next > fsync() from some other context runs, pushing the CIL to disk and > we only log those objects modified in unlink to the journal once. > > With deferred activation, the two phases of unlink are temporally > decoupled. We get a bunch of inodes running the first phase in > unlink() context, but the second phase is run later in a tight loop > from workqueue context. But this means that fsync()s occur between > the two pahses, and hence the objects modified in the two phases of > unlink are modified in two separate CIL contexts. Hence they get > written to the log twice. > > Depending on the way things work out, deferred inactivation also > results in longer unlinked inode chains, resulting in more objects > being logged per unlink than without deferred inactivation, as the > inodes are added to the unlink chain and then immediately removed > before any others are added. Hence deferred inode inactivation will > increase the amount written to the log per unlink if the two phases > of unlink as split across journal checkpoints. > > IOWs, an increase in log writes for open-write-fsync-close-unlink > workloads is not unexpected. But this workload isn't really a real > world workload in any way - we generally don't do data integrity > writes only to immediately throw the data away. :/ > > Remove the fsync and we do have a real world workload - temporary > files for compiles, etc. Then everything should mostly get logged > in the same CIL context because all transactions are run > asynchronously and aggregate in memory until the CIL is full and > checkpoints itself. Hence log writes shouldn't change very much at > all for such workloads. OK, that makes sense. Thanks for explanation. So to verify your theory, I've removed fsync(2) from the test program. So now it is pure create, write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000 loops of create-unlink to increase runtime (but the workload does barely any writes so it should not matter wrt log): default mkfs params: AVG STDDEV before 2.0380 0.1597 after 2.7356 0.4712 agcount=96, log size 512M AVG STDDEV before 1.0610 0.0227 after 1.2508 0.0218 So still notable regression with the async inactivation. With default mkfs params we'd need more runs to get more reliable results (note the rather high standard deviation) but with high AG count results show pretty stable 20% regression - so let's have a look at that. Looking at xfs stats there are barely any differences between before & after - 'after' writes a bit more to the log but it is ~1.5 MB over the whole benchmark run, altogether spending some 8ms doing IO so that's not it. Generally the workload seems to be CPU / memory bound now (it does barely any IO). Perf shows we are heavily contending on some spinlock in xfs_cil_commit() - I presume this is a xc_cil_lock. This actually happens both before and after, but we seem to spend some more time there with async inactivation. Likely this is related to work being done from worker threads. Perf stats for comparison: before 51,135.08 msec cpu-clock # 47.894 CPUs utilized 4,699 context-switches # 0.092 K/sec 382 cpu-migrations # 0.007 K/sec 1,228 page-faults # 0.024 K/sec 128,884,972,351 cycles # 2.520 GHz 38,517,767,839 instructions # 0.30 insn per cycle 8,337,611,468 branches # 163.051 M/sec 39,749,736 branch-misses # 0.48% of all branches 25,225,109 cache-misses 1.067666170 seconds time elapsed after 65,353.43 msec cpu-clock # 47.894 CPUs utilized 43,737 context-switches # 0.669 K/sec 1,824 cpu-migrations # 0.028 K/sec 1,953 page-faults # 0.030 K/sec 155,144,150,867 cycles # 2.374 GHz 45,280,145,337 instructions # 0.29 insn per cycle 10,027,567,384 branches # 153.436 M/sec 39,554,691 branch-misses # 0.39% of all branches 30,203,567 cache-misses 1.364539400 seconds time elapsed So we can see huge increase in context-switches, notable increase in cache-misses, decrease in cycles/s so perhaps we are bouncing cache more? Anyway I guess this is kind of expected due to the nature of async inactivation, I just wanted to highlight that there are regressions without fsync in the game as well. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR