On Wednesday 12 September 2007 04:31, Mel Gorman wrote: > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote: > > Hi Mel, > > Hi, > > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote: > > > that increasing the pagesize like what Andrea suggested would lead to > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's > > > > The config_page_shift guarantees the kernel stacks or whatever not > > defragmentable allocation other allocation goes into the same 64k "not > > defragmentable" page. Not like with SGI design that a 8k kernel stack > > could be allocated in the first 64k page, and then another 8k stack > > could be allocated in the next 64k page, effectively pinning all 64k > > pages until Nick worst case scenario triggers. > > In practice, it's pretty difficult to trigger. Buddy allocators always > try and use the smallest possible sized buddy to split. Once a 64K is > split for a 4K or 8K allocation, the remainder of that block will be > used for other 4K, 8K, 16K, 32K allocations. The situation where > multiple 64K blocks gets split does not occur. > > Now, the worst case scenario for your patch is that a hostile process > allocates large amount of memory and mlocks() one 4K page per 64K chunk > (this is unlikely in practice I know). The end result is you have many > 64KB regions that are now unusable because 4K is pinned in each of them. > Your approach is not immune from problems either. To me, only Nicks > approach is bullet-proof in the long run. One important thing I think in Andrea's case, the memory will be accounted for (eg. we can limit mlock, or work within various memory accounting things). With fragmentation, I suspect it will be much more difficult to do this. It would be another layer of heuristics that will also inevitably go wrong at times if you try to limit how much "fragmentation" a process can do. Quite likely it is hard to make something even work reasonably well in most cases. > > We can still try to save some memory by > > defragging the slab a bit, but it's by far *not* required with > > config_page_shift. No defrag at all is required infact. > > You will need to take some sort of defragmentation to deal with internal > fragmentation. It's a very similar problem to blasting away at slab > pages and still not being able to free them because objects are in use. > Replace "slab" with "large page" and "object" with "4k page" and the > issues are similar. Well yes and slab has issues today too with internal fragmentation, targetted reclaim and some (small) higher order allocations too today. But at least with config_page_shift, you don't introduce _new_ sources of problems (eg. coming from pagecache or other allocs). Sure, there are some other things -- like pagecache can actually use up more memory instead -- but there are a number of other positives that Andrea's has as well. It is using order-0 pages, which are first class throughout the VM; they have per-cpu queues, and do not require any special reclaim code. They also *actually do* reduce the page management overhead in the general case, unlike higher order pcache. So combined with the accounting issues, I think it is unfair to say that Andrea's is just moving the fragmentation to internal. It has a number of upsides. I have no idea how it will actually behave and perform, mind you ;) > > Plus there's a cost in defragging and freeing cache... the more you > > need defrag, the slower the kernel will be. > > > > > approach in depth. > > > > Well it wasn't my fault if we didn't discuss it in depth though. > > If it's my fault, sorry about that. It wasn't my intention. I think it did get brushed aside a little quickly too (not blaming anyone). Maybe because Linus was hostile. But *if* the idea is that page management overhead has or will become a problem that needs fixing, then neither higher order pagecache, nor (obviously) fsblock, fixes this properly. Andrea's most definitely has the potential to. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html