Re: [00/41] Large Blocksize Support V7 (adds memmap support)

Nick Piggin <nickpiggin@xxxxxxxxxxxx> · Fri, 14 Sep 2007 07:20:05 +1000

On Thursday 13 September 2007 09:17, Christoph Lameter wrote:
> On Wed, 12 Sep 2007, Nick Piggin wrote:
> > I will still argue that my approach is the better technical solution for
> > large block support than yours, I don't think we made progress on that.
> > And I'm quite sure we agreed at the VM summit not to rely on your patches
> > for VM or IO scalability.
>
> The approach has already been tried (see the XFS layer) and found lacking.

It is lacking because our vmap algorithms are simplistic to the point
of being utterly inadequate for the new requirements. There has not
been any fundamental problem revealed (like the fragmentation 
approach has).

However fsblock can do everything that higher order pagecache can
do in terms of avoiding vmap and giving contiguous memory to block
devices by opportunistically allocating higher orders of pages, and falling
back to vmap if they cannot be satisfied.

So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?

> Having a fake linear block through vmalloc means that a special software
> layer must be introduced and we may face special casing in the block / fs
> layer to check if we have one of these strange vmalloc blocks.

I guess you're a little confused. There is nothing fake about the linear
address. Filesystems of course need changes (the block layer needs
none -- why would it?). But changing APIs to accommodate a better
solution is what Linux is about.

If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the "extra layer"
argument as a point against fsblock here?

> > But you just showed in two emails that you don't understand what the
> > problem is. To reiterate: lumpy reclaim does *not* invalidate my
> > formulae; and antifrag does *not* isolate the issue.
>
> I do understand what the problem is. I just do not get what your problem
> with this is and why you have this drive to demand perfection. We are

Oh. I don't think I could explain that if you still don't understand by now.
But that's not the main issue: all that I ask is you consider fsblock on
technical grounds. 

> working a variety of approaches on the (potential) issue but you
> categorically state that it cannot be solved.

Of course I wouldn't state that. On the contrary, I categorically state that
I have already solved it :)

> > But what do you say about viable alternatives that do not have to
> > worry about these "unlikely scenarios", full stop? So, why should we
> > not use fs block for higher order page support?
>
> Because it has already been rejected in another form and adds more

You have rejected it. But they are bogus reasons, as I showed above.

You also describe some other real (if lesser) issues like number of page
structs to manage in the pagecache. But this is hardly enough to reject
my patch now... for every downside you can point out in my approach, I
can point out one in yours.

- fsblock doesn't require changes to virtual memory layer
- fsblock can retain cache of just 4K in a > 4K block size file

How about those? I know very well how Linus feels about both of them.

> Maybe we coud get to something like a hybrid that avoids some of these
> issues? Add support so something like a virtual compound page can be
> handled transparently in the filesystem layer with special casing if
> such a beast reaches the block layer?

That's conceptually much worse, IMO.

And practically worse as well: vmap space is limited on 32-bit; fsblock
approach can avoid vmap completely in many cases; for two reasons.

The fsblock data accessor APIs aren't _that_ bad changes. They change
zero conceptually in the filesystem, are arguably cleaner, and can be
essentially nooped if we wanted to stay with a b_data type approach
(but they give you that flexibility to replace it with any implementation).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html