On (11/09/07 12:26), Nick Piggin didst pronounce: > On Wednesday 12 September 2007 04:31, Mel Gorman wrote: > > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote: > > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote: > > > > that increasing the pagesize like what Andrea suggested would lead to > > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's > > > > > > The config_page_shift guarantees the kernel stacks or whatever not > > > defragmentable allocation other allocation goes into the same 64k "not > > > defragmentable" page. Not like with SGI design that a 8k kernel stack > > > could be allocated in the first 64k page, and then another 8k stack > > > could be allocated in the next 64k page, effectively pinning all 64k > > > pages until Nick worst case scenario triggers. > > > > In practice, it's pretty difficult to trigger. Buddy allocators always > > try and use the smallest possible sized buddy to split. Once a 64K is > > split for a 4K or 8K allocation, the remainder of that block will be > > used for other 4K, 8K, 16K, 32K allocations. The situation where > > multiple 64K blocks gets split does not occur. > > > > Now, the worst case scenario for your patch is that a hostile process > > allocates large amount of memory and mlocks() one 4K page per 64K chunk > > (this is unlikely in practice I know). The end result is you have many > > 64KB regions that are now unusable because 4K is pinned in each of them. > > Your approach is not immune from problems either. To me, only Nicks > > approach is bullet-proof in the long run. > > One important thing I think in Andrea's case, the memory will be accounted > for (eg. we can limit mlock, or work within various memory accounting things). > For mlock()ed sure. Not for pagetables though, kmalloc slabs etc. It might be a non-issue as well. Like the large block patches, there are aspects of Andrea's case that we simply do not know. > With fragmentation, I suspect it will be much more difficult to do this. It > would be another layer of heuristics that will also inevitably go wrong > at times if you try to limit how much "fragmentation" a process can do. > Quite likely it is hard to make something even work reasonably well in > most cases. Regreattably, this is also woefully difficult to prove. For fragmentation, I can look into having a more expensive version of /proc/pagetypeinfo to give a detailed account of the current fragmentation state but it's a side-issue. > > > We can still try to save some memory by > > > defragging the slab a bit, but it's by far *not* required with > > > config_page_shift. No defrag at all is required infact. > > > > You will need to take some sort of defragmentation to deal with internal > > fragmentation. It's a very similar problem to blasting away at slab > > pages and still not being able to free them because objects are in use. > > Replace "slab" with "large page" and "object" with "4k page" and the > > issues are similar. > > Well yes and slab has issues today too with internal fragmentation, > targetted reclaim and some (small) higher order allocations too today. > But at least with config_page_shift, you don't introduce _new_ sources > of problems (eg. coming from pagecache or other allocs). > Well, we do extend the internal fragmentation problem. Previous, it was inode, dcache and friends. Now we have to deal with internal fragmentation related to page tables, per-cpu pages etc. Maybe they can be solved too, but they are of similar difficulty to what Christoph faces. > Sure, there are some other things -- like pagecache can actually use > up more memory instead -- but there are a number of other positives > that Andrea's has as well. It is using order-0 pages, which are first class > throughout the VM; they have per-cpu queues, and do not require any > special reclaim code. Being able to use the per-cpu queues is a big plus. > They also *actually do* reduce the page > management overhead in the general case, unlike higher order pcache. > > So combined with the accounting issues, I think it is unfair to say that > Andrea's is just moving the fragmentation to internal. It has a number > of upsides. I have no idea how it will actually behave and perform, mind > you ;) > Neither do I. Andrea's suggestion definitly has upsides. I'm just saying it's not going to cure cancer any better than the large block patchset ;) > > > > Plus there's a cost in defragging and freeing cache... the more you > > > need defrag, the slower the kernel will be. > > > > > > > approach in depth. > > > > > > Well it wasn't my fault if we didn't discuss it in depth though. > > > > If it's my fault, sorry about that. It wasn't my intention. > > I think it did get brushed aside a little quickly too (not blaming anyone). > Maybe because Linus was hostile. But *if* the idea is that page > management overhead has or will become a problem that needs fixing, > then neither higher order pagecache, nor (obviously) fsblock, fixes this > properly. Andrea's most definitely has the potential to. Fair point. What way to jump now is the question. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html