On 11/14/2016 02:27 AM, Dave Chinner wrote:
On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
[ Long stalls from xfs_reclaim_inodes_ag ]
XFS has *1* tunable that can change the behaviour of metadata
writeback. Please try it.
[ weeks go by, so this email is insanely long ]
Testing all of this was slow going because two of the three test
boxes I had with the hadoop configuration starting having hardware
problems. The good news is that while I was adjusting the
benchmark, we lined up access to a bunch of duplicate boxes, so I
can now try ~20 different configurations in parallel.
My rough benchmark is here:
git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
The command line I ended up using was:
simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
60 -w 300 -D 2 /hammer/*
There's a lightly tested patch below that should do the trick.
After 5 minutes on a modified simoop cli on a single filesystem,
4.9-rc4+for-next:
$ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
....
Run time: 300 seconds
Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)
With the patch below:
Run time: 300 seconds
Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)
Stall rate went to zero and stayed there at the 120s mark of the
warmup. Note the p99 difference for read and allocation latency,
too.
I'll post some graphs tomorrow from my live PCP telemetry that
demonstrate the difference in behaviour better than any words...
Thanks Dave, this is definitely better. But at least for the multi-disk
setup, it's not quite there yet.
Your patch:
___
Run time: 15535 seconds
Read latency (p50: 22,708,224) (p95: 34,668,544) (p99: 41,746,432)
Write latency (p50: 21,200,896) (p95: 34,799,616) (p99: 41,877,504)
Allocation latency (p50: 11,550,720) (p95: 31,424,512) (p99: 39,518,208)
work rate = 7.48/sec (avg 8.41/sec) (p50: 8.69) (p95: 9.57) (p99: 9.87)
alloc stall rate = 14.08/sec (avg: 14.85) (p50: 15.74) (p95: 19.74)
(p99: 22.04)
Original patch:
___
Run time: 15474 seconds
Read latency (p50: 20,414,464) (p95: 29,786,112) (p99: 34,275,328)
Write latency (p50: 15,155,200) (p95: 25,591,808) (p99: 31,621,120)
Allocation latency (p50: 7,675,904) (p95: 22,970,368) (p99: 29,523,968)
work rate = 8.33/sec (avg 10.54/sec) (p50: 10.54) (p95: 10.58) (p99: 10.58)
alloc stall rate = 37.08/sec (avg: 21.73) (p50: 23.16) (p95: 24.68)
(p99: 25.00)
v4.8
___
Run time: 15492 seconds
Read latency (p50: 22,642,688) (p95: 35,848,192) (p99: 43,712,512)
Write latency (p50: 21,200,896) (p95: 35,454,976) (p99: 43,450,368)
Allocation latency (p50: 12,599,296) (p95: 34,144,256) (p99: 41,615,360)
work rate = 9.77/sec (avg 8.15/sec) (p50: 8.37) (p95: 9.29) (p99: 9.55)
alloc stall rate = 8.33/sec (avg: 33.65) (p50: 34.52) (p95: 37.40) (p99:
37.96)
One thing that might have been too far buried in my email yesterday.
The read/write latencies include the time to start threads, that's not
just IO in there.
I've had this running all day, but the differences stabilized after 5-10
minutes. Everyone's p99s trend higher as the day goes on, but the
percentage difference stays about the same.
I think the difference between mine and yours is we didn't quite get the
allocation stalls down to zero, so making tasks wait for the shrinker
shows up in the end numbers.
For your patch, the stalls from xfs_reclaim_inodes_ag() were about the
same as the unpatched kernel yesterday. We still had long tails in the
30 second+ category.
I did a trace on vmscan:mm_vmscan_direct_reclaim_begin, and 91% of the
allocation stalls were:
order=0 may_writepage=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_ZERO
classzone_idx=3
These are all page faults, either during read() syscalls or when simoop
was touching all the pages in its mmap()'d area (-M from the cmdline)
One detail I didn't give yesterday was these are all using deadline as
the IO scheduler.
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html