On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote: > On 11/14/2016 02:27 AM, Dave Chinner wrote: > >On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote: > >>On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote: > >> > >>[ Long stalls from xfs_reclaim_inodes_ag ] > >> > >>>XFS has *1* tunable that can change the behaviour of metadata > >>>writeback. Please try it. > >> > >>[ weeks go by, so this email is insanely long ] > >> > >>Testing all of this was slow going because two of the three test > >>boxes I had with the hadoop configuration starting having hardware > >>problems. The good news is that while I was adjusting the > >>benchmark, we lined up access to a bunch of duplicate boxes, so I > >>can now try ~20 different configurations in parallel. > >> > >>My rough benchmark is here: > >> > >>git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git > >> > >>The command line I ended up using was: > >> > >>simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i > >>60 -w 300 -D 2 /hammer/* > > > >There's a lightly tested patch below that should do the trick. > > > >After 5 minutes on a modified simoop cli on a single filesystem, > >4.9-rc4+for-next: > > > >$ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch > >.... > >Run time: 300 seconds > >Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168) > >Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056) > >Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800) > >work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99) > >alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12) > > > >With the patch below: > > > >Run time: 300 seconds > >Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400) > >Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752) > >Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232) > >work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93) > >alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01) > > > >Stall rate went to zero and stayed there at the 120s mark of the > >warmup. Note the p99 difference for read and allocation latency, > >too. > > > >I'll post some graphs tomorrow from my live PCP telemetry that > >demonstrate the difference in behaviour better than any words... > > Thanks Dave, this is definitely better. But at least for the > multi-disk setup, it's not quite there yet. I can see why now I've run the test for half an hour to get a decent idea of steady state behaviour. I'll come back to that.... > Your patch: > ___ > Run time: 15535 seconds > Read latency (p50: 22,708,224) (p95: 34,668,544) (p99: 41,746,432) > Write latency (p50: 21,200,896) (p95: 34,799,616) (p99: 41,877,504) > Allocation latency (p50: 11,550,720) (p95: 31,424,512) (p99: 39,518,208) > work rate = 7.48/sec (avg 8.41/sec) (p50: 8.69) (p95: 9.57) (p99: 9.87) > alloc stall rate = 14.08/sec (avg: 14.85) (p50: 15.74) (p95: 19.74) > (p99: 22.04) > > Original patch: > ___ > Run time: 15474 seconds > Read latency (p50: 20,414,464) (p95: 29,786,112) (p99: 34,275,328) > Write latency (p50: 15,155,200) (p95: 25,591,808) (p99: 31,621,120) > Allocation latency (p50: 7,675,904) (p95: 22,970,368) (p99: 29,523,968) > work rate = 8.33/sec (avg 10.54/sec) (p50: 10.54) (p95: 10.58) (p99: 10.58) > alloc stall rate = 37.08/sec (avg: 21.73) (p50: 23.16) (p95: 24.68) > (p99: 25.00) Ok, so stall rate is down, but the latencies are still garbage because stalls are happening. > I've had this running all day, but the differences stabilized after > 5-10 minutes. Everyone's p99s trend higher as the day goes on, but > the percentage difference stays about the same. Right - they stabilise after the first few stalls because they dominate the p99 numbers. For example, with my patch, I saw the first stall event between 1080 and 1140 seconds - you can see this clearly in the 3rd chart: Run time: 1080 seconds Read latency (p50: 3,035,136) (p95: 4,284,416) (p99: 5,349,376) Write latency (p50: 27,623,424) (p95: 31,031,296) (p99: 45,154,304) Allocation latency (p50: 240,384) (p95: 709,632) (p99: 1,308,672) work rate = 3.75/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04) alloc stall rate = 0.65/sec (avg: 0.04) (p50: 0.00) (p95: 0.03) (p99: 0.03) du thread is done /mnt/scratch ___ Run time: 1140 seconds Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424) Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376) Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960) work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04) alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12) du thread is running /mnt/scratch So a single "stall event" blows out the p99 latencies really badly. This is probably the single most important revelation about this so far... > I think the difference between mine and yours is we didn't quite get > the allocation stalls down to zero, so making tasks wait for the > shrinker shows up in the end numbers. Right, but so far we haven't answered the obvious question: what triggers the stall events? > For your patch, the stalls from xfs_reclaim_inodes_ag() were about > the same as the unpatched kernel yesterday. We still had long tails > in the 30 second+ category. Because when a stall happens, it's doing more than just trashing the inode cache. The 4 charts I've attached in a tarball are: Chart 1: unpatched kernel, inode/dentry/xfs_inode cache usage Chart 2: unpatched kernel, memory usage Chart 3: unpatched kernel, inode/dentry/xfs_inode cache usage Chart 4: unpatched kernel, memory usage These are taken with a 5s sample period, so we're looking at overall trends, not fine grained behaviour. Things to note: - discontinuities in the graphs indicate an unresponsive machine and samples were lost. Good sign that the machine is stalling on memory allocation - dentry/vfs inode counts come from /proc/sys/fs/{dentry,inode}_state - memory usage chart: "other" is userspace and kernel memory other than the page cache ("cached"). - cached + other + free = 16GB (all memory) - "slab" is a subset of "other", but is split out above 16GB so i can easily see salb cache memory usage - "dirty" is a subset of "cached", but is split out so I can see how much of the page cache is dirty. Now, with that in mind, let's compare the graphs. Let's start with the inode/dentry/xfs_inode cache usage. We can see that the unpatched kernel has rapid rises and falls in cache footprint. The vertical falling edges are evidence of allocation stalls. You can see that it's somewhat regular - the lines stay roughly level, then we see a deep, sharp valley where the caches are completely freed and then immediately repopulate. This generally co-incides with periodic "du scan" starting - it adds new memory pressure, tipping the kernel into memory reclaim, which immediately stalls and trashes the cache. Given that there is now lots of memory free, the du runs at several thousand IOPS (SSDs, remember) and immediately refills the cache and finishes, producing no more memory pressure. The cache then sits at the "full" level until du starts again and we fall off the cliff again. If we switch to the memory usage graph for the unpatched kernel (chart 2), we can see that allocation stalls trash more than just the slab caches - they clean all memory and free all the page cache, too. i.e. it completely trashes any working set that we might have. This is not caused by the shrinker - it has no effect on page cache reclaim. It appears that page reclaim code is, for some reason, agressively freeing the page cache for multiple seconds, even after there is already noticable free memory. This is probably an issue with there being too much direct reclaim stalling due to pressure (e.g. on dirty pages), and then when the cork is pulled we've got lots of threads that then run a full direct reclaim scan and that empties the caches... /me has always said unbound concurrency in memory reclaim and IO writeback is a bad thing because you cannot control and limit behaviour in overload situations like this. Anyway, lets look at the patched kernel inode cache behaviour - chart 3. The first thing to note is that there is a very different pattern - there's a very regular triangular waveform. The trough occurs when the du thread starts, and the peak occurs when the du thread stops. IOWs, the inode cache is no longer retaining single use inodes in the cache for extended periods. The waveform is neat and regular, and it's very clear where the allocation stall at ~1100s occurred - there's a reclaim cliff. If we look at chart 4 right now - the patched kernel memory usage - we can see there is a big dip in the page cache, too. So when an allocation stall event occurs, the bad behaviour is still there. However, the way the system responds to a stall event is very different - it doesn't immediately refill the inode cache. It goes back to the triangle waveform until the next stall occurs. What is particularly interesting in chart 3 is the vfs.inodes.count and xfs_inode slab metrics (red and pink lines respectively). The active xfs_inode slab objects starts off below the vfs inode count, then over time 6 cycles xfs_inode count slowly increases above th vfs inode count. Then we get an allocation stall event, and it returns the xfs_inode count to less than the vfs inode count. The second stall even occurs at a similar point. The xfs_inode > vfs inode count values indicate that we've got XFs inodes in the reclaim state and that reclaim of XFs inodes is not keeping up with the VFS inode cache reclaim rate. This is indicative of either dirty inodes being reclaimed, or we are not putting enough pressure on the XFS inode cache and so it's putting additional pressure on other parts of memory reclaim. This change in balance seems to correlate to allocation stalls being triggered, so I'm going to try to remove that imbalance and see if that avoids the allocation stalls. IOWs, I'm not using latency stats to guide me - they tell me nothing useful about what causes the latency. The charts tell me when there is a reclaim issue - I can see it clear as anything. If the graphs are smooth, maintain a consistent relationship to each other and behave nicely when there are step changes in behaviour then everything else will take care of itself. It's easy to get lost in creating fancy histograms with shiny new tools like ebpf/bcc, but this is mostly wasted effort when it comes to finding the cause of the problem. Latency histograms don't help us understand why the latencies occur, nor do they help track down the cause. The charts I have show that there is an obvious, one-off, large scale memory reclaim event that is causing the stalls and bad latencies. Finding the trigger for that large scale event is what matters, not pretty histograms of it's effect... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx
Attachment:
charts.tar.gz
Description: application/gzip