Re: [00/41] Large Blocksize Support V7 (adds memmap support)

mel@xxxxxxxxx (Mel Gorman) · Tue, 11 Sep 2007 22:03:48 +0100

On (11/09/07 12:26), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > > that increasing the pagesize like what Andrea suggested would lead to
> > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > >
> > > The config_page_shift guarantees the kernel stacks or whatever not
> > > defragmentable allocation other allocation goes into the same 64k "not
> > > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > > could be allocated in the first 64k page, and then another 8k stack
> > > could be allocated in the next 64k page, effectively pinning all 64k
> > > pages until Nick worst case scenario triggers.
> >
> > In practice, it's pretty difficult to trigger. Buddy allocators always
> > try and use the smallest possible sized buddy to split. Once a 64K is
> > split for a 4K or 8K allocation, the remainder of that block will be
> > used for other 4K, 8K, 16K, 32K allocations. The situation where
> > multiple 64K blocks gets split does not occur.
> >
> > Now, the worst case scenario for your patch is that a hostile process
> > allocates large amount of memory and mlocks() one 4K page per 64K chunk
> > (this is unlikely in practice I know). The end result is you have many
> > 64KB regions that are now unusable because 4K is pinned in each of them.
> > Your approach is not immune from problems either. To me, only Nicks
> > approach is bullet-proof in the long run.
> 
> One important thing I think in Andrea's case, the memory will be accounted
> for (eg. we can limit mlock, or work within various memory accounting things).
> 

For mlock()ed sure. Not for pagetables though, kmalloc slabs etc. It
might be a non-issue as well. Like the large block patches, there are
aspects of Andrea's case that we simply do not know.

> With fragmentation, I suspect it will be much more difficult to do this. It
> would be another layer of heuristics that will also inevitably go wrong
> at times if you try to limit how much "fragmentation" a process can do.
> Quite likely it is hard to make something even work reasonably well in
> most cases.

Regreattably, this is also woefully difficult to prove. For
fragmentation, I can look into having a more expensive version of
/proc/pagetypeinfo to give a detailed account of the current
fragmentation state but it's a side-issue.

> > >  We can still try to save some memory by
> > > defragging the slab a bit, but it's by far *not* required with
> > > config_page_shift. No defrag at all is required infact.
> >
> > You will need to take some sort of defragmentation to deal with internal
> > fragmentation. It's a very similar problem to blasting away at slab
> > pages and still not being able to free them because objects are in use.
> > Replace "slab" with "large page" and "object" with "4k page" and the
> > issues are similar.
> 
> Well yes and slab has issues today too with internal fragmentation,
> targetted reclaim and some (small) higher order allocations too today.
> But at least with config_page_shift, you don't introduce _new_ sources
> of problems (eg. coming from pagecache or other allocs).
> 

Well, we do extend the internal fragmentation problem. Previous, it was
inode, dcache and friends. Now we have to deal with internal
fragmentation related to page tables, per-cpu pages etc. Maybe they can
be solved too, but they are of similar difficulty to what Christoph
faces.

> Sure, there are some other things -- like pagecache can actually use
> up more memory instead -- but there are a number of other positives
> that Andrea's has as well. It is using order-0 pages, which are first class
> throughout the VM; they have per-cpu queues, and do not require any
> special reclaim code.

Being able to use the per-cpu queues is a big plus.

> They also *actually do* reduce the page
> management overhead in the general case, unlike higher order pcache.
> 
> So combined with the accounting issues, I think it is unfair to say that
> Andrea's is just moving the fragmentation to internal. It has a number
> of upsides. I have no idea how it will actually behave and perform, mind
> you ;)
> 

Neither do I. Andrea's suggestion definitly has upsides. I'm just saying
it's not going to cure cancer any better than the large block patchset ;)

> 
> > > Plus there's a cost in defragging and freeing cache... the more you
> > > need defrag, the slower the kernel will be.
> > >
> > > > approach in depth.
> > >
> > > Well it wasn't my fault if we didn't discuss it in depth though.
> >
> > If it's my fault, sorry about that. It wasn't my intention.
> 
> I think it did get brushed aside a little quickly too (not blaming anyone).
> Maybe because Linus was hostile. But *if* the idea is that page
> management overhead has or will become a problem that needs fixing,
> then neither higher order pagecache, nor (obviously) fsblock, fixes this
> properly. Andrea's most definitely has the potential to.

Fair point.

What way to jump now is the question.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html