[PATCH v3 0/4] xfs: rework near mode extent allocation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Here's a v3 of the near allocation mode extent rework algorithm. The
most obvious difference from v2 is that I've dropped the by-size and
exact mode algorithm reworks and subsequent refactoring from the series
for now. It became a bit unwieldy to keep everything together with all
of the refactoring and whatnot, so I'll plan to follow up with those
bits once the near mode algorithm and associated factoring is settled.
Refer to the previous version(s) for an idea of how this approach
facilitates reuse by the other allocation modes.

Another notable change from v2 is reintroduction of the "last block
scan" in the near mode algorithm. I found that this logic tends to
dominate on allocations from a clean filesystem for quite some time, so
this change adds some consistency between current behavior/performance
and with this patch and makes the fundamental algorithm change a bit
more conservative. From an algorithmic perspective, this change now
essentially enhances the fallback algorithm from a pure bnobt span to a
combined bnobt/cntbt span for improved worst case performance at a
slight cost in locality.

This version has been more thoroughly tested than the previous versions.
I've run fstests with various configurations without regression and some
performance tests to compare with the existing allocator with regard to
performance and resulting allocation locality. While the latter is a bit
ad hoc, it provides some data as to how much this patch changes locality
allocations over time. IOW, the locality data is not necessarily for
deep analysis or detailed comparison of locality between the algorithms,
but rather to provide a data point that locality is still effective with
the algorithm change.

I'll append some raw data to the end of this mail for some of the more
recent filebench tests. For context, each test runs on the same 16xcpu
128GB RAM box against the same striped lvm volume. The test is a simple
64x concurrent file write+fsync of a random distribution of file sizes
from 4k to 500MB to fill up ~70% of a 500GB (16 AGs) fs. The locality
output is from a hacked up version of the xfs_io 'frag' command to scan
for and measure distances of known near mode allocations from the
presumed hint (i.e., the previous extent in a file or the inode location
for initial extents). The largest histogram bucket is the AG size and
represents AG switches (i.e., an allocation occurred on an external AG
from the locality hint).

The first set of tests are run against a clean filesystem and show
comparable performance and locality between the original and updated
algorithms. The second set of tests prepopulates the filesystem and
creates a sequence of randomly sized free space extents before the test
runs. This test is designed to stress the updated algorithm by creating
a large set of unique and randomly sized free extents. Not surprisingly,
this test shows slightly degraded performance for the updated algorithm
when compared to the baseline. While this is not quite an apples to
apples comparison, this is a significantly milder degradation than what
occurs when the baseline algorithm encounters its worst case free space
layout (refer to the v1 cover letter for details). Note that both
variants of the test were repeated a couple times but I'm only including
one set of output for each since the runs were similar.

An interesting design note related to the performance results is that
some followup experimentation shows that the performance is fairly
sensitive to the search key increment aggressiveness implemented in
xfs_alloc_lookup_iter(). This controls how aggressive we search for
ideal locality in the lookup algorithm once we've already found one
suitable maxlen extent. The currently implemented approach is to visit
each set of equal size extents until we have an allocation candidate and
from that point forward, double the search cur key to optimize the scan
of the rest of the tree for better locality. A hack to this logic to
exit on the first available extent can turn the above slight performance
degradation into an overall improvement at the cost of random locality
(i.e., this basically becomes a by-size allocation). I've played around
with this a bit to try and improve this heuristic and close the gap from
the current algorithm, including even considering some form of sysfs
knob to tune behavior (even if just for debug/testing), but so far I
haven't come up with anything preferable to the simplicity of the
current approach that doesn't negatively affect the effectiveness of the
allocator with respect to locality.

Thoughts, reviews, flames appreciated.

Brian

v3:
- Drop by-size and exact allocation rework bits.
- Add near mode last block scan.
- Add debug mode patch to randomly toggle near mode algos.
- Refactor cursor setup/lookup logic.
- Refactor minlen reverse scan to be common between near mode algos.
- Fix up logic to consistently prioritize extent size over locality.
- Add more useful tracepoints.
- Miscellaneous bug fixes and code/comment cleanups.
v2: https://marc.info/?l=linux-xfs&m=155854834815400&w=2
- Lift small mode refactoring into separate patch (retained review
  tag(s).
- Various logic cleanups and refactors.
- Push active flag down into btree cursor private area; eliminate cursor
  container struct.
- Refactor final allocation code. Fold xfs_alloc_ag_vextent_type() into
  caller and factor out accounting.                                                                                                                     
- Fix up tracepoints.
v1: https://marc.info/?l=linux-xfs&m=155742169729590&w=2
- Continued development (various fixes, refinements) on generic bits and
  near mode implementation.
- Added patches 4-6 to refactor exact, by-size and small allocation
  modes.
rfcv2: https://marc.info/?l=linux-xfs&m=155197946630582&w=2
- Dropped spurious initial refactoring.
- Added minlen functionality.
- Properly tied into near alloc path.
- General refactoring and cleanups.
rfcv1: https://marc.info/?l=linux-xfs&m=154479089914351&w=2

Brian Foster (4):
  xfs: track active state of allocation btree cursors
  xfs: use locality optimized cntbt lookups for near mode allocations
  xfs: randomly fall back to near mode lookup algorithm in debug mode
  xfs: refactor successful AG allocation accounting code

 fs/xfs/libxfs/xfs_alloc.c       | 1156 +++++++++++++++++--------------
 fs/xfs/libxfs/xfs_alloc_btree.c |    1 +
 fs/xfs/libxfs/xfs_btree.h       |    3 +
 fs/xfs/xfs_trace.h              |   43 +-
 4 files changed, 660 insertions(+), 543 deletions(-)

-- 
2.20.1

--- test results

- clean fs, baseline:

 2016: 3537.429: Run took 3536 seconds...
 2016: 3537.431: Per-Operation Breakdown
closefile1           39939ops       11ops/s   0.0mb/s      1.2ms/op      430us/op-cpu [0ms - 1812ms]
fsync1               39944ops       11ops/s   0.0mb/s   5557.3ms/op   688845us/op-cpu [3ms - 429730ms]
writefile1           39999ops       11ops/s 100.4mb/s      6.8ms/op    10280us/op-cpu [0ms - 532ms]
createfile1          40000ops       11ops/s   0.0mb/s      0.5ms/op      814us/op-cpu [0ms - 801ms]
 2016: 3537.431: IO Summary: 159882 ops, 45.211 ops/s, (0/11 r/w), 100.4mb/s,  12807us cpu/op, 5558.1ms latency
 2016: 3537.431: Shutting down processes

actual 40732, ideal 40720, fragmentation factor 0.03%
Note, this number is largely meaningless.
Files on this filesystem average 1.00 extents per file
   from      to extents  blocks    pct
      0       0       6       0   0.01
      1       1      81      81   0.20
      2       3     343     884   0.84
      4       7    1173    6786   2.88
      8      15    3527   35273   8.66
     16      31     488   11400   1.20
     32      63    3153  140510   7.74
     64     127    2567  233234   6.30
    128     255    1175  201550   2.88
    256     511     568  216547   1.39
    512    1023     627  460669   1.54
   1024    2047     831 1357531   2.04
   2048    4095     857 2498448   2.10
   4096    8191     597 3637740   1.47
   8192   16383     516 6196647   1.27
  16384   32767     690 16778626   1.69
  32768   65535    1189 58881806   2.92
  65536  131071    1582 152275877   3.88
 131072  262143    2078 397775535   5.10
 262144  524287    3163 1250504038   7.77
 524288 1048575    2789 2010791315   6.85
1048576 2097151    2956 4678090228   7.26
2097152 4194303    4084 12741930013  10.03
4194304 8191983    2239 11372991216   5.50
8191984 8191984    3453 28286920752   8.48

- clean fs, test:

19362: 3540.426: Run took 3539 seconds...
19362: 3540.428: Per-Operation Breakdown
closefile1           39942ops       11ops/s   0.0mb/s      1.1ms/op      459us/op-cpu [0ms - 1991ms]
fsync1               39945ops       11ops/s   0.0mb/s   5566.4ms/op   692442us/op-cpu [3ms - 464333ms]
writefile1           39999ops       11ops/s 100.3mb/s      6.9ms/op    10198us/op-cpu [0ms - 544ms]
createfile1          40000ops       11ops/s   0.0mb/s      0.4ms/op      675us/op-cpu [0ms - 752ms]
19362: 3540.428: IO Summary: 159886 ops, 45.174 ops/s, (0/11 r/w), 100.3mb/s,  12928us cpu/op, 5567.2ms latency
19362: 3540.428: Shutting down processes

actual 40730, ideal 40720, fragmentation factor 0.02%
Note, this number is largely meaningless.
Files on this filesystem average 1.00 extents per file
   from      to extents  blocks    pct
      0       0       6       0   0.01
      1       1     109     109   0.27
      2       3     338     856   0.83
      4       7    1136    6582   2.79
      8      15    3408   33781   8.37
     16      31     648   14649   1.59
     32      63    3145  142563   7.72
     64     127    2809  258454   6.90
    128     255    1482  255051   3.64
    256     511     502  191520   1.23
    512    1023     654  487802   1.61
   1024    2047    1067 1752771   2.62
   2048    4095    1109 3314767   2.72
   4096    8191     906 5534201   2.22
   8192   16383     649 8100247   1.59
  16384   32767     812 18740413   1.99
  32768   65535     996 46874568   2.45
  65536  131071    1861 181198557   4.57
 131072  262143    2570 480355851   6.31
 262144  524287    1801 663068591   4.42
 524288 1048575    2010 1564391753   4.93
1048576 2097151    3008 4760143526   7.39
2097152 4194303    4485 13773793202  11.01
4194304 8191983    1686 8263583893   4.14
8191984 8191984    3533 28942279472   8.67

- randomized free space, baseline:

 1933: 4053.531: Run took 4052 seconds...
 1933: 4053.533: Per-Operation Breakdown
closefile1           39939ops       10ops/s   0.0mb/s      1.1ms/op      470us/op-cpu [0ms - 1339ms]
fsync1               39939ops       10ops/s   0.0mb/s   6379.4ms/op   677902us/op-cpu [3ms - 376494ms]
writefile1           39997ops       10ops/s  87.6mb/s      6.8ms/op    10159us/op-cpu [0ms - 522ms]
createfile1          40000ops       10ops/s   0.0mb/s      0.7ms/op      887us/op-cpu [0ms - 371ms]
 1933: 4053.533: IO Summary: 159875 ops, 39.452 ops/s, (0/10 r/w),  87.6mb/s,  13219us cpu/op, 6378.8ms latency
 1933: 4053.533: Shutting down processes

actual 82107, ideal 72320, fragmentation factor 11.92%
Note, this number is largely meaningless.
Files on this filesystem average 1.14 extents per file
   from      to extents  blocks    pct
      0       0       2       0   0.00
      1       1      74      74   0.09
      2       3     371     962   0.45
      4       7    1174    6710   1.43
      8      15    3633   36965   4.42
     16      31     859   19650   1.05
     32      63    3244  145941   3.95
     64     127    2945  267753   3.59
    128     255    1940  347233   2.36
    256     511    1537  579226   1.87
    512    1023    2597 1977748   3.16
   1024    2047    4773 7310593   5.81
   2048    4095    8990 27491096  10.95
   4096    8191   16422 100901820  20.00
   8192   16383     816 9837468   0.99
  16384   32767     914 21952573   1.11
  32768   65535    2037 98635533   2.48
  65536  131071    3434 336852501   4.18
 131072  262143    3340 593712009   4.07
 262144  524287    2459 951760115   2.99
 524288 1048575    2808 2116317254   3.42
1048576 2097151    2479 3510197318   3.02
2097152 4194303    1005 2975191867   1.22
4194304 8191983     571 3166999981   0.70
8191984 8191984   13683 112090917072  16.66

- randomized free space, test:

 1837: 4117.481: Run took 4116 seconds...
 1837: 4117.483: Per-Operation Breakdown
closefile1           39937ops       10ops/s   0.0mb/s      1.5ms/op      535us/op-cpu [0ms - 1507ms]
fsync1               39937ops       10ops/s   0.0mb/s   6488.7ms/op   682138us/op-cpu [2ms - 399490ms]
writefile1           40000ops       10ops/s  86.2mb/s      6.8ms/op    10332us/op-cpu [0ms - 524ms]
createfile1          40000ops       10ops/s   0.0mb/s      0.7ms/op      852us/op-cpu [0ms - 1027ms]
 1837: 4117.483: IO Summary: 159874 ops, 38.839 ops/s, (0/10 r/w),  86.2mb/s,  13295us cpu/op, 6487.5ms latency
 1837: 4117.483: Shutting down processes

actual 82311, ideal 72323, fragmentation factor 12.13%
Note, this number is largely meaningless.
Files on this filesystem average 1.14 extents per file
   from      to extents  blocks    pct
      0       0       7       0   0.01
      1       1      95      95   0.12
      2       3     395    1032   0.48
      4       7    1169    6696   1.42
      8      15    3512   35631   4.27
     16      31    1167   25350   1.42
     32      63    3563  161086   4.33
     64     127    3296  299665   4.00
    128     255    2370  426781   2.88
    256     511    2568  977733   3.12
    512    1023    4357 3271492   5.29
   1024    2047    5871 8937475   7.13
   2048    4095    9396 28541302  11.42
   4096    8191   16546 101380257  20.10
   8192   16383     400 4415125   0.49
  16384   32767     206 4922473   0.25
  32768   65535     256 12692379   0.31
  65536  131071     318 30349888   0.39
 131072  262143     467 90335617   0.57
 262144  524287     803 312116185   0.98
 524288 1048575    1238 972993740   1.50
1048576 2097151    2173 3398364897   2.64
2097152 4194303    3594 11074400645   4.37
4194304 8191983    4362 25547937595   5.30
8191984 8191984   14182 116178717088  17.23



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux