Re: [RFC PATCH v2 00/12] Introduce the famfs shared-memory file system

John Groves <John@xxxxxxxxxx> · Tue, 30 Apr 2024 21:09:18 -0500

On 24/04/29 11:11PM, Kent Overstreet wrote:
> On Mon, Apr 29, 2024 at 09:24:19PM -0500, John Groves wrote:
> > On 24/04/29 07:08PM, Kent Overstreet wrote:
> > > On Mon, Apr 29, 2024 at 07:32:55PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Apr 29, 2024 at 12:04:16PM -0500, John Groves wrote:
> > > > > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > > > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > > > > CXL-specific in anyway way.
> > > > > 
> > > > > * Famfs creates a simple access method for storing and sharing data in
> > > > >   sharable memory. The memory is exposed and accessed as memory-mappable
> > > > >   dax files.
> > > > > * Famfs supports multiple hosts mounting the same file system from the
> > > > >   same memory (something existing fs-dax file systems don't do).
> > > > 
> > > > Yes, but we do already have two filesystems that support shared storage,
> > > > and are rather more advanced than famfs -- GFS2 and OCFS2.  What are
> > > > the pros and cons of improving either of those to support DAX rather
> > > > than starting again with a new filesystem?
> > > 
> > > I could see a shared memory filesystem as being a completely different
> > > beast than a shared block storage filesystem - and I've never heard
> > > anyone talking about gfs2 or ocfs2 as codebases we particularly liked.
> > 
> > Thanks for your attention on famfs, Kent.
> > 
> > I think of it as a completely different beast. See my reply to Willy re:
> > famfs being more of a memory allocator with the benefit of allocations 
> > being accessible (and memory-mappable) as files.
> 
> That's pretty much what I expected.
> 
> I would suggest talking to RDMA people; RDMA does similar things with
> exposing address spaces across machine, and an "external" memory
> allocator is a basic building block there as well - it'd be great if we
> could get that turned into some clean library code.
> 
> GPU people as well, possibly.

Thanks for your attention Kent.

I'm on it. Part of the core idea behind famfs is that page-oriented data
movement can be avoided with actual shared memory. Yes, the memory is likely to 
be slower (either BW or latency or both) but it's cacheline access rather than 
full-page (or larger) retrieval, which is a win for some access patterns (and
not so for others).

Part of the issue is communicating the fact that shared access to cachelines
is possible.

There are some interesting possibilities with GPUs retrieving famfs files
(or portions thereof), but I have no insight as to the motivations of GPU 
vendors.

> 
> > The famfs user space repo has some good documentation as to the on-
> > media structure of famfs. Scroll down on [1] (the documentation from
> > the famfs user space repo). There is quite a bit of info in the docs
> > from that repo.
> 
> Ok, looking through that now.
> 
> So youv've got a metadata log; that looks more like a conventional
> filesystem than a conventional purely in-memory thing.
> 
> But you say it's a shared filesystem, and it doesn't say anything about
> that. Inter node locking?
> 
> Perhaps the ocfs2/gfs2 comparison is appropriate, after all.

Famfs is intended to be mounted from more than one host from the same in-memory
image. A metadata log is kinda the simpliest approach to make that work (let me
know your thoughts if you disagree on that). When a client mounts, playing the 
log from the shared memory brings that client mount into sync with the source 
(the Master).

No inter-node locking is currently needed because only the node that created
the file system (the Master) can write the log. Famfs is not intended to be 
a general-purpose FS...

The famfs log is currently append-only, and I think of it as a "code-first"
implementation of a shared memory FS that that gets the job done in something
approaching the simplest possible approach.

If the approach evolves to full allocate-on-write, then moving to a file system
platform that handles that would make sense. If it remains (as I suspect will
make sense) a way to share collections of data sets, or indexes, or other 
data that is published and then consumed [all or mostly] read-only, this
simple approach may be long-term sufficient.

Regards,
John