Re: [RFC PATCH] xfs: support for non-mmu architectures

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 23 Nov 2015 07:50:00 -0500

On Mon, Nov 23, 2015 at 09:04:00AM +1100, Dave Chinner wrote:
> On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote:
> > On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> > > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > > > Those latter calls are all from following down through the
> > > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> > > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> > > > down into the same code. Indeed, we already protect ourselves here via
> > > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
> > > 
> > > Yes, we do, but that is separately handled to the allocation of the
> > > pages, which we have to do for all types of buffers, mapped or
> > > unmapped, because xfs_buf_ioapply_map() requires direct access to
> > > the underlying pages to build the bio for IO.  If we delegate the
> > > allocation of pages to vmalloc, we don't have direct reference to
> > > the underlying pages and so we have to do something completely
> > > diffferent to build the bios for the buffer....
> > > 
> > 
> > Octavian points out virt_to_page() in a previous mail. I'm not sure
> > that's the right interface solely based on looking at some current
> > callers, but there is vmalloc_to_page() so I'd expect we can gain access
> > to the pages one way or another.
> 
> Sure, but these are not zero cost operations....
> 
> > Given that, the buffer allocation code
> > would fully populate the xfs_buf as it is today. The buffer I/O
> > submission code wouldn't really know the difference and shouldn't have
> > to change at all.
> 
> The abstraction results in more expensive/complex setup and teardown
> of buffers and/or IO submission. i.e. the use of vmalloc() based
> abstractions has an additional cost over what we do now.
> 
> [...]
> 

Yeah, most likely. The vmalloc based lookup certainly looks more
involved than if the pages are already readily accessible. How much more
expensive (and whether it's noticeable enough to care) is probably a
matter of testing. I think the code itself could end up much more
simple, however. There's still a lot of duplication in our current
implementation that afaiu right now only continues to exist due to the
performance advantages of vm_map_ram().

> > Either way, it would require significantly more investigation/testing to
> > enable generic usage. The core point was really just to abstract the
> > nommu changes into something that potentially has generic use.
> 
> I'm not saying that it is impossible to do this, just trying to work
> out if making any changes to support nommu architectures is worth
> the potential future trouble making such changes could bring us.
> i.e. before working out how to do something, we have to decide
> whether it is worth doing  in the first place.
> 
> Just because you can do something doesn't automatically make it a
> good idea....
> 

Sure, good point. I've intentionally been sticking to the technical
points that have been raised (and I think we've covered it thoroughly to
this point ;).

FWIW, I agree with most of the concerns that have been called out in the
thread so far, but the support question ultimately sounds moot to me. We
can reject the "configuration" enabled by this particular patch, but
afaik this LKL thing could just go and implement an mmu mode and
fundamentally do what it's trying to do today. If it's useful, users
will use it, ultimately hit bugs, and we'll have to deal with it one way
or another.

Note that I'm not saying we have to support LKL. Rather, I view it as
equivalent to somebody off running some new non-standard/uncommon
architecture (or maybe a hypervisor is a better example in this case).
If the user runs into some low-level filesystem issue, the user probably
needs to get through whatever levels of support exist for that special
architecture/environment first and/or otherwise work with us to find a
reproducer on something more standard that we all have easy access to.
Just my .02, though. :)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs