Re: [00/41] Large Blocksize Support V7 (adds memmap support)

Nick Piggin <nickpiggin@xxxxxxxxxxxx> · Tue, 11 Sep 2007 07:13:54 +1000

On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > 5. VM scalability
> > >    Large block sizes mean less state keeping for the information being
> > >    transferred. For a 1TB file one needs to handle 256 million page
> > >    structs in the VM if one uses 4k page size. A 64k page size reduces
> > >    that amount to 16 million. If the limitation in existing filesystems
> > >    are removed then even higher reductions become possible. For very
> > >    large files like that a page size of 2 MB may be beneficial which
> > >    will reduce the number of page struct to handle to 512k. The
> > > variable nature of the block size means that the size can be tuned at
> > > file system creation time for the anticipated needs on a volume.
> >
> > The idea that there even _is_ a bug to fail when higher order pages
> > cannot be allocated was also brushed aside by some people at the
> > vm/fs summit. I don't know if those people had gone through the
> > math about this, but it goes somewhat like this: if you use a 64K
> > page size, you can "run out of memory" with 93% of your pages free.
> > If you use a 2MB page size, you can fail with 99.8% of your pages
> > still free. That's 64GB of memory used on a 32TB Altix.
>
> While I agree with your concern, those numbers are quite silly.  The

They are the theoretical worst case. Obviously with a non trivially
sized system and non-DoS workload, they will not be reached.

> chances of 99.8% of pages being free and the remaining 0.2% being
> perfectly spread across all 2MB large_pages are lower than those of SHA1
> creating a collision.  I don't see anyone abandoning git or rsync, so
> your extreme example clearly is the wrong one.
>
> Again, I agree with your concern, even though your example makes it look
> silly.

It is not simply a question of once-off chance for an all-at-once layout
to fail in this way. Fragmentation slowly builds over time, and especially
if you do actually use higher-order pages for a significant number of
things (unlike we do today), then the problem will become worse. If you
have any part of your workload that is affected by fragmentation, then
it will cause unfragmented regions to eventually be used for fragmentation
inducing allocations (by definition -- if it did not, eg. then there would be
no fragmentation problem and no need for Mel's patches).

I don't know what happens as time tends towards infinity, but I don't think
it will be good.

At millions of allocations per second, how long does it take to produce
an unacceptable number of free pages before the ENOMEM condition?
Furthermore, what *is* an unacceptable number? I don't know. I am not
trying to push this feature in, so the burden is not mine to make sure it
is OK.

Yes, we already have some of these problems today. Introducing more
and worse problems and justifying them because of existing ones is much
more silly than my quoting of the numbers. IMO.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html