On Wed, Aug 23, 2017 at 10:12:14AM +0200, Vlastimil Babka wrote: > On 08/23/2017 07:36 AM, Joonsoo Kim wrote: > > On Mon, Aug 21, 2017 at 10:10:14AM -0400, Johannes Weiner wrote: > >> On Wed, Aug 09, 2017 at 01:58:42PM -0700, David Rientjes wrote: > >>> On Thu, 27 Jul 2017, Vlastimil Babka wrote: > >>> > >>>> As we discussed at last LSF/MM [1], the goal here is to shift more compaction > >>>> work to kcompactd, which currently just makes a single high-order page > >>>> available and then goes to sleep. The last patch, evolved from the initial RFC > >>>> [2] does this by recording for each order > 0 how many allocations would have > >>>> potentially be able to skip direct compaction, if the memory wasn't fragmented. > >>>> Kcompactd then tries to compact as long as it takes to make that many > >>>> allocations satisfiable. This approach avoids any hooks in allocator fast > >>>> paths. There are more details to this, see the last patch. > >>>> > >>> > >>> I think I would have liked to have seen "less proactive" :) > >>> > >>> Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it > >>> continues until it can defragment memory. On a host with 128GB of memory > >>> and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd > >>> wakeups for order-2 memory allocation. The stats are pretty bad: > >>> > >>> compact_migrate_scanned 2931254031294 > >>> compact_free_scanned 102707804816705 > >>> compact_isolated 1309145254 > >>> > >>> 0.0012% of memory scanned is ever actually isolated. We constantly see > >>> very high cpu for compaction_alloc() because kcompactd is almost always > >>> running in the background and iterating most memory completely needlessly > >>> (define needless as 0.0012% of memory scanned being isolated). > >> > >> The free page scanner will inevitably wade through mostly used memory, > >> but 0.0012% is lower than what systems usually have free. I'm guessing > >> this is because of concurrent allocation & free cycles racing with the > >> scanner? There could also be an issue with how we do partial scans. > >> > >> Anyway, we've also noticed scalability issues with the current scanner > >> on 128G and 256G machines. Even with a better efficiency - finding the > >> 1% of free memory, that's still a ton of linear search space. > >> > >> I've been toying around with the below patch. It adds a free page > >> bitmap, allowing the free scanner to quickly skip over the vast areas > >> of used memory. I don't have good data on skip-efficiency at higher > >> uptimes and the resulting fragmentation yet. The overhead added to the > >> page allocator is concerning, but I cannot think of a better way to > >> make the search more efficient. What do you guys think? > > > > Hello, Johannes. > > > > I think that the best solution is that the compaction doesn't do linear > > scan completely. Vlastimil already have suggested that idea. > > I was going to bring this up here, thanks :) > > > mm, compaction: direct freepage allocation for async direct > > compaction > > > > lkml.kernel.org/r/<1459414236-9219-5-git-send-email-vbabka@xxxxxxx> > > > > It uses the buddy allocator to get a freepage so there is no linear > > scan. It would completely remove scalability issue. > > Another big advantage is that migration scanner would get to see the > whole zone, and not be biased towards the first 1/3 until it meets the > free scanner. And another advantage is that we wouldn't be splitting > free pages needlessly. > > > Unfortunately, he applied this idea only to async compaction since > > changing the other compaction mode will probably cause long term > > fragmentation. And, I disagreed with that idea at that time since > > different compaction logic for different compaction mode would make > > the system more unpredicatable. > > > > I doubt long term fragmentation is a real issue in practice. We loses > > too much things to prevent long term fragmentation. I think that it's > > the time to fix up the real issue (yours and David's) by giving up the > > solution for long term fragmentation. > > I'm now also more convinced that this direction should be pursued, and > wanted to get to it after the proactive kcompactd part. My biggest > concern is that freelists can give us the pages from the same block that > we (or somebody else) is trying to compact (migrate away). Isolating > (i.e. MIGRATE_ISOLATE) the block first would work, but the overhead of > the isolation could be significant. But I have some alternative ideas > that could be tried. > > > If someone doesn't agree with above solution, your approach looks the > > second best to me. Though, there is something to optimize. > > > > I think that we don't need to be precise to track the pageblock's > > freepage state. Compaction is a far rare event compared to page > > allocation so compaction could be tolerate with false positive. > > > > So, my suggestion is: > > > > 1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best > > to me. > > Wouldn't the reusing cripple the original use for the migration scanner? I think that there is no serious problem. Problem happens if we set PB_migrate_skip wrongly. Consider following two cases that set PB_migrate_skip. 1) migration scanner find that whole pages in the pageblock is pinned. -> set skip -> it is cleared after one of the page is freed. No problem. There is a possibility that temporary pinned page is unpinned and we miss this pageblock but it would be minor case. 2) migration scanner find that whole pages in the pageblock are free. -> set skip -> we can miss the pageblock for a long time. We need to fix 2) case in order to reuse PB_migrate_skip. I guess that just counting the number of freepage in isolate_migratepages_block() and considering it to not set PB_migrate_skip will work. > > > 2) Mark PB_migrate_skip only in free path and only when needed. > > Unmark it in compaction if freepage scan fails in that pageblock. > > In compaction, skip the pageblock if PB_migrate_skip is set. It means > > that there is no freepage in the pageblock. > > > > Following is some code about my suggestion. > > Otherwise is sounds like it could work until the direct allocation > approach is fully developed (or turns out to be infeasible). Agreed. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>