Re: [PATCH 00/13] IO-less dirty throttling

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 17 Nov 2010 18:25:38 +1100

On Wed, Nov 17, 2010 at 11:58:21AM +0800, Wu Fengguang wrote:
> Andrew,
> 
> This is a revised subset of "[RFC] soft and dynamic dirty throttling limits"
> <http://thread.gmane.org/gmane.linux.kernel.mm/52966>.
> 
> The basic idea is to introduce a small region under the bdi dirty threshold.
> The task will be throttled gently when stepping into the bottom of region,
> and get throttled more and more aggressively as bdi dirty+writeback pages
> goes up closer to the top of region. At some point the application will be
> throttled at the right bandwidth that balances with the device write bandwidth.
> (the first patch and documentation has more details)
> 
> Changes from initial RFC:
> 
> - adaptive ratelimiting, to reduce overheads when under throttle threshold
> - prevent overrunning dirty limit on lots of concurrent dirtiers
> - add Documentation/filesystems/writeback-throttling-design.txt
> - lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
> - don't drop the laptop mode code
> - update and comment the trace event
> - benchmarks on concurrent dd and fs_mark covering both large and tiny files
> - bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
>   otherwise it will drift fast and fluctuate
> - don't call balance_dirty_pages_ratelimit() when writing to already dirtied
>   pages, otherwise the task will be throttled too much
> 
> The patches are based on 2.6.37-rc2 and Jan's sync livelock patches. For easier
> access I put them in
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v2

Great - just pulled it down and I'll start running some tests.

The tree that I'm testing has the vfs inode lock breakup in it, the
inode cache SLAB_DESTROY_BY_RCU series, a large bunch of XFS lock
breakup patches and now the above branch in it. It's here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working

> On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> improves IO throughput from 38MB/s to 42MB/s.

Excellent - I suspect that the reduction in contention on the inode
writeback locks is responsible for dropping the CPU usage right down.

I'm seeing throughput for a _single_ large dd (100GB) increase from ~650MB/s
to 700MB/s with your series. For other numbers of dd's:
							ctx switches
# dd processes		total throughput	 total        per proc
   1			  700MB/s		    400/s	100/s
   2			  700MB/s		    500/s	100/s
   4			  700MB/s		    700/s	100/s
   8			  690MB/s		  1,100/s	100/s
  16			  675MB/s		  2,000/s	110/s
  32			  675MB/s		  5,000/s	150/s
 100			  650MB/s		 22,000/s	210/s
1000			  600MB/s		160,000/s	160/s

A couple of things I noticed - firstly, the number of context
switches scales roughly with the number of writing processes - is
there any reason for waking every writer 100-200 times a second? At
the thousand writer mark, we reach a context switch rate of more
than one per page we complete IO on. Any idea on whether this can be
improved at all?

Also, the system CPU usage while throttling stayed quite low but not
constant. The more writing processes, the lower the system CPU usage
(despite the increase in context switches). Further, if the dd's
didn't all start at the same time, then system CPU usage would
roughly double when the first dd's complete and cpu usage stayed
high until all the writers completed. So there's some trigger when
writers finish/exit there that is changing throttle behaviour.
Increasing the number of writers does not seem to have any adverse
affects.

BTW, killing a thousand dd's all stuck on the throttle is near
instantaneous. ;)

> The fs_mark benchmark is interesting. The CPU overheads are almost reduced by
> half. Before patch the benchmark is actually bounded by CPU. After patch it's
> IO bound, but strangely the throughput becomes slightly slower.

The "App Overhead" that is measured by fs_mark is the time it spends
doing stuff in userspace rather than in syscalls. Changes in the app
overhead typically implies a change in syscall CPU cache footprint. A
substantial reduction in app overhead for the same amount of work
is good. :)

[cut-n-paste from your comment about being io bound below]

> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.17    0.00   97.87    1.08    0.00    0.88

That looks CPU bound, not IO bound.

> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sdc               0.00    63.00    0.00  125.00     0.00  1909.33    30.55     3.88   31.65   6.57  82.13
> sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sde               0.00    19.00    0.00  112.00     0.00  1517.17    27.09     3.95   35.33   8.00  89.60
> sdg               0.00    92.67    0.33  126.00     2.67  1773.33    28.12    14.83  120.78   7.73  97.60
> sdf               0.00    32.33    0.00   91.67     0.00  1408.17    30.72     4.84   52.97   7.72  70.80
> sdh               0.00    17.67    0.00    5.00     0.00   124.00    49.60     0.07   13.33   9.60   4.80
> sdi               0.00    44.67    0.00    5.00     0.00   253.33   101.33     0.15   29.33  10.93   5.47
> sdl               0.00   168.00    0.00  135.67     0.00  2216.33    32.67     6.41   45.42   5.75  78.00
> sdk               0.00   225.00    0.00  123.00     0.00  2355.83    38.31     9.50   73.03   6.94  85.33
> sdj               0.00     1.00    0.00    2.33     0.00    26.67    22.86     0.01    2.29   1.71   0.40
> sdb               0.00    14.33    0.00  101.67     0.00  1278.00    25.14     2.02   19.95   7.16  72.80
> sdm               0.00   150.33    0.00  144.33     0.00  2344.50    32.49     5.43   33.94   5.39  77.73

And that's totalling ~1000 iops during the workload - you're right
in that it doesn't look at all well balanced. The device my test
filesystem is on is running at ~15,000 iops and 120MB/s for the same
workload, but there is another layer of reordering on the host as
well as 512MB of BBWC between the host and the spindles, so maybe
you won't be able to get near that number with your setup....

[.....]

> avg                                    1182.761      533488581.833
> 
> 2.6.36+
> FSUse%        Count         Size    Files/sec     App Overhead
....
> avg                                    1146.768      294684785.143

The difference between the files/s numbers is pretty much within
typical variation of the benchmark. I tend to time the running of
the entire benchmark because the files/s output does not include the
"App Overhead" time and hence you can improve files/s but increase
the app overhead and the overall wall time can be significantly
slower...

FWIW, I'd consider the throughput (1200 files/s) to quite low for 12
disks and a number of CPUs being active. I'm not sure how you
configured the storage/filesystem, but you should configure the
filesystem with at least 2x as many AGs as there are CPUs, and run
one create thread per CPU rather than one per disk.  Also, making
sure you have a largish log (512MB in this case) is helpful, too.

For example, I've got a simple RAID0 of 12 disks that is 1.1TB in
size when I stripe the outer 10% of the drives together (or 18TB if
I stripe the larger inner partitions on the disks). The way I
normally run it (on an 8p/4GB RAM VM) is:

In the host:

$ cat dmtab.fast.12drive 
0 2264924160 striped  12 1024 /dev/sdb1 0 /dev/sdc1 0 /dev/sdd1 0 /dev/sde1 0 /dev/sdf1 0 /dev/sdg1 0 /dev/sdh1 0 /dev/sdi1 0 /dev/sdj1 0 /dev/sdk1 0 /dev/sdl1 0 /dev/sdm1 0
$ sudo dmsetup create fast dmtab.fast.12drive
$ sudo mount -o nobarrier,logbsize=262144,delaylog,inode64 /dev/mapper/fast /mnt/fast

[VM creation script uses fallocate to preallocate 1.1TB file as raw
disk image inside /mnt/fast, appears to guest as /dev/vdb]

In the VM:

# mkfs.xfs -f -l size=131072b -d agcount=16 /dev/vdb
....
# mount -o nobarrier,inode64,delaylog,logbsize=262144 /dev/vdb /mnt/scratch
# /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
>       -d /mnt/scratch/0 -d /mnt/scratch/1 \
>       -d /mnt/scratch/2 -d /mnt/scratch/3 \
>       -d /mnt/scratch/4 -d /mnt/scratch/5 \
>       -d /mnt/scratch/6 -d /mnt/scratch/7

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7 
#       Version 3.3, 8 thread(s) starting at Wed Nov 17 15:27:33 2010
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            1      27825.7         11686554
     0      1600000            1      22650.2         13199876
     1      2400000            1      23606.3         12297973
     1      3200000            1      23060.5         12474339
     1      4000000            1      22677.4         12731120
     2      4800000            1      23095.7         12142813
     2      5600000            1      22639.2         12813812
     2      6400000            1      23447.1         12330158
     3      7200000            1      22775.8         12548811
     3      8000000            1      22766.5         12169732
     3      8800000            1      21685.5         12546771
     4      9600000            1      22899.5         12544273
     4     10400000            1      22950.7         12894856
.....

The above numbers are without your patch series. The following
numbers are with your patch series:

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            1      26163.6         10492957
     0      1600000            1      21960.4         10431605
     1      2400000            1      22099.2         10971110
     1      3200000            1      22052.1         10470168
     1      4000000            1      21264.4         10398188
     2      4800000            1      21815.3         10445699
     2      5600000            1      21557.6         10504866
     2      6400000            1      21856.0         10421309
     3      7200000            1      21853.5         10613164
     3      8000000            1      21309.4         10642358
     3      8800000            1      22130.8         10457972
.....

Ok, so throughput is also down by ~5% from ~23k files/s to ~22k
files/s. On the plus side:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00   43.45   46.56    0.00    8.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
vdb               0.00 12022.20    1.60 11431.60     0.01   114.09    20.44    32.34    2.82   0.08  94.64
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

The number of write IOs has dropped Ñignificantly and CPU usage is
more than halved - this was running at ~98% system time!  So for a
~5% throughput reduction, CPU usage has dropped by ~55% and the
number of write IOs have dropped by ~25%. That's a pretty good
result - it's the single biggest drop in CPU usage as a result of
preventing lock contention I've seen on an 8p machine in the past 6
months. Very promising - I guess it's time to look at the code again. :)

Hmmm - looks like the probably bottleneck is that the flusher thread
is close to CPU bound:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2215 root      20   0     0    0    0 R   86  0.0   2:16.43 flush-253:16

             samples  pcnt function                        DSO
             _______ _____ _______________________________ _________________

            32436.00  5.8% _xfs_buf_find                   [kernel.kallsyms]
            26119.00  4.7% kmem_cache_alloc                [kernel.kallsyms]
            17700.00  3.2% __ticket_spin_lock              [kernel.kallsyms]
            14592.00  2.6% xfs_log_commit_cil              [kernel.kallsyms]
            14341.00  2.6% _raw_spin_unlock_irqrestore     [kernel.kallsyms]
            12537.00  2.2% __kmalloc                       [kernel.kallsyms]
            12098.00  2.2% writeback_single_inode          [kernel.kallsyms]
            12078.00  2.2% xfs_iunlock                     [kernel.kallsyms]
            10712.00  1.9% redirty_tail                    [kernel.kallsyms]
            10706.00  1.9% __make_request                  [kernel.kallsyms]
            10469.00  1.9% bit_waitqueue                   [kernel.kallsyms]
            10107.00  1.8% kfree                           [kernel.kallsyms]
            10028.00  1.8% _cond_resched                   [kernel.kallsyms]
             9244.00  1.7% xfs_fs_write_inode              [kernel.kallsyms]
             8759.00  1.6% xfs_iflush_cluster              [kernel.kallsyms]
             7944.00  1.4% queue_io                        [kernel.kallsyms]
             7924.00  1.4% radix_tree_gang_lookup_tag_slot [kernel.kallsyms]
             7468.00  1.3% kmem_cache_free                 [kernel.kallsyms]
             7454.00  1.3% xfs_bmapi                       [kernel.kallsyms]
             7149.00  1.3% writeback_sb_inodes             [kernel.kallsyms]
             5882.00  1.1% xfs_btree_lookup                [kernel.kallsyms]
             5811.00  1.0% __memcpy                        [kernel.kallsyms]
             5446.00  1.0% xfs_alloc_ag_vextent_near       [kernel.kallsyms]
             5346.00  1.0% xfs_trans_buf_item_match        [kernel.kallsyms]
             4704.00  0.8% xfs_perag_get                   [kernel.kallsyms]

That's looking like it's XFS overhead flushing inodes, so that's not
an issue caused by this patch. Indeed, I'm used to seeing 30-40% of
the CPU time here in __ticket_spin_lock, so it certainly appears
that most of the CPU time saving comes from the removal of
contention on the inode_wb_list_lock. I guess it's time for me to
start looking at multiple bdi-flusher threads again....

> I noticed that
> 
> 1) BdiWriteback can grow very large. For example, bdi 8:16 has 72960KB
>    writeback pages, however the disk IO queue can hold at most
>    nr_request*max_sectors_kb=128*512kb=64MB writeback pages. Maybe xfs manages
>    to create perfect sequential layouts and writes, and the other 8MB writeback
>    pages are flying inside the disk?

There's a pretty good chance that this is exactly what is happening.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html