Hi Ted, Everyone,
During our last discussions you mentioned the following (2017/08/16 5:06
SAST/GMT+2):
"One other thought. There is an ext4 block allocator optimization
"feature" which is biting us here. At the moment we have an
optimization where if there is small "hole" in the logical block
number space, we leave a "hole" in the physical blocks allocated to
the file."
You proceeded to provide the example regarding writing of object files
as per binutils (ld specifically).
As per the data I provided you previously rsync (with --sparse) is
generating a lot of "holes" for us due to this. As a result I end up
with a rather insane amount of fragmentation:
Blocksize: 4096 bytes
Total blocks: 13153337344
Free blocks: 1272662587 (9.7%)
Min. free extent: 4 KB
Max. free extent: 17304 KB
Avg. free extent: 44 KB
Num. free extent: 68868260
HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
4K... 8K- : 28472490 28472490 2.24%
8K... 16K- : 27005860 55030426 4.32%
16K... 32K- : 2595993 14333888 1.13%
32K... 64K- : 2888720 32441623 2.55%
64K... 128K- : 2745121 62071861 4.88%
128K... 256K- : 2303439 103166554 8.11%
256K... 512K- : 1518463 134776388 10.59%
512K... 1024K- : 902691 163108612 12.82%
1M... 2M- : 314858 105445496 8.29%
2M... 4M- : 97174 64620009 5.08%
4M... 8M- : 22501 28760501 2.26%
8M... 16M- : 945 2069807 0.16%
16M... 32M- : 5 21155 0.00%
Based on the behavior I notice by watching how rsync works[1] I greatly
suspect that writes are sequential from start of file to end of file.
Regarding the above "feature" you further proceeded to mention:
"However, it obviously doesn't do the right thing for rsync --sparse,
and these days, thanks to delayed allocation, so long as binutils can
finish writing the blocks within 30 seconds, it doesn't matter if GNU
ld writes the blocks in a completely random order, since we will only
attempt to do the writeback to the disk after all of the holes in the
.o file have been filled in. So perhaps we should turn off this ext4
block allocator optimization if delayed allocation is enabled (which
is the default these days)."
You mentioned a few pros and cons of this approach as well, and also
mentioned that it won't help my existing filesystem, however, I suspect
it might in combination with a e4defrag sweep (which if it takes a few
weeks in the background that's fine by me). Also, I suspect disabling
this might help avoid future holes, and since persistence of files
varies (from a week to a year) I suspect it may help to over time slowly
improve performance.
I'm also relatively comfortable to make the 30s write limit even longer
(as you pointed out the files causing the problems are typically 300GB+
even though on average my files are very small), permitting that I won't
introduce additional file-system corruption risk. Also keeping in mind
that I run anything from 10 to 20 concurrent rsync instances at any
point in time.
I would like to attempt such a patch, so if you (or someone else) could
possibly point me in an appropriate direction of where to start work on
this I would really appreciate the help.
Another approach for me may be to simply switch off --sparse since
especially now I'm unsure of it's benefit. I'm guessing that I could do
a sweep of all inodes to determine how much space is really being saved
by this.
Kind Regards,
Jaco
[1] My observed behaviour when syncing a file (without --inplace which
is in my opinion a bad idea in general unless you're severely space
constrained, and then I honestly don't know how this situation would be
affected) is that rsync will create a new file, and then the file size
of this file will grow slowly (not, not disk usage, but size as reported
by ls) until it reaches the file size of the new file, and at this point
rsync will use rename(2) to replace the old file with the new one (which
is the right approach).