On Fri, Feb 02, 2018 at 02:16:39PM +0000, Robert Harris wrote: > I was planning to annotate the opaque calculation in > __fragmentation_index() but on closer inspection I think there may be a > bug. I could use some feedback. > > Firstly, for the case of fragmentation and ignoring the scaling, > __fragmentation_index() purports to return a value in the range 0 to 1. > Generally, however, the lower bound is actually 0.5. Here's an > illustration using a zone that I fragmented with selective calls to > __alloc_pages() and __free_pages --- the fragmentation for order-1 could > not be minimised further yet is reported as 0.5: > > # head -1 /proc/buddyinfo > Node 0, zone DMA 1983 0 0 0 0 0 0 0 0 0 0 > # head -1 /sys/kernel/debug/extfrag/extfrag_index > Node 0, zone DMA -1.000 0.500 0.750 0.875 0.937 0.969 0.984 0.992 0.996 0.998 0.999 > # > > This is significant because 0.5 is the default value of > sysctl_extfrag_threshold, meaning that compaction will not be suppressed > for larger blocks when memory is scarce rather than fragmented. Of > course, sysctl_extfrag_threshold is a tuneable so the first question is: > does this even matter? > It's now 8 years since it was written so my memory is rusty. While the bounds could be adjusted, it's not without risk. The bounds were left as-is and the sysctl to avoid possibilties of excessive reclaim -- something early implementations suffered badly. At the time of implementation, it was used as a rough estimate for monitoring purposes but on an allocation failure, it was always page reclaim that was used to try the allocation again. At a later time, compaction was introduced to avoid excessive reclaim but the cutoff was set to only happen for extreme memory shortage (and the bounds should have been corrected at the time but were not). It was a long time before all the excessive reclaim bugs in kswapd were ironed out but bugs of runaway kswapd at 100% CPU usage were common for a while. There were also severeal problems with compaction overhead that were adjusted in other matters. It may have reached the point where revisiting the sysctl is potentially safe given that reclaim is considerably better than it used to be. > meaning that a very severe shortage of free memory *could* tip the > balance in favour of "low fragmentation". Although this seems highly > unlikely to occur outside testing, it does reflect the directive in the > comment above the function, i.e. favour page reclaim when fragmentation > is low. My second question: is the current implementation of F is > intentional and, if not, what is the actual intent? > It's intentional but could be fixed to give a real bound of 0 to 1 instead of half the range as it currently give. The sysctl_extfrag_threshold should also be adjusted at that time. After that, the real work is determining if it's safe to strike a balance between reclaim/compaction that avoids unnecessary compaction while not being too aggressive about reclaim or having kswapd enter a runaway loop with a reintroduction of the "kswapd stuck at 100% CPU time" problems. Alternative, delete references to it entirely as the cutoff is not really being used and the monitoring information is too specialised to be of general use. > The comments in compaction_suitable() suggest that the compaction/page > reclaim decision is one of cost but, as compaction is linear, this isn't > what __fragmentation_index() is calculating. The index was not intended as an estimate of the cost of compaction. It was originally intended to act as an estimator of whether it's ebtter to spend time reclaiming or compacting. Compacting was favoured on the grounds that high order allocations were meant to be able to fail where as reclaiming potentially useful data could have other consequences. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>