Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file

Zach Brown <zab@xxxxxxxxx> · Thu, 6 Feb 2025 10:58:12 -0800

(Yay, back from travel!)

On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote:
> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@xxxxxxxxx> wrote:
> >
> > Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
> > a good background on this -
> > https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
> > looking at taking advantage of modern low latency NVME devices and
> > today's networks to implement a distributed file system that provides
> > better concurrency that high object counts need and still have the
> > bandwidth needed to support the backend archival systems we feed.
> >
> 
> I heard this talk and it was very interesting.
> Here's a direct link to slides from people who may be too lazy to
> follow 3 clicks:
> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf
> 
> I was both very impressed by the cache coherent rename example
> and very puzzled - I do not know any filesystem where rename can be
> synchronized on a single block io, and looking up ancestors is usually
> done on in-memory dentries, so I may not have understood the example.

The meat of that talk was about how ngnfs uses its distributed block
cache as a serializing/coherence/consistency mechanism.  That specific
example was about how we can get concurrent rename between different
mounts without needing some global equivelant of rename mutex.

The core of the mechanism is that code paths that implement operations
have a transactional object that holds on to cached block references
which have a given access mode granted over the network.  In this rename
case, the ancestor walk holds on to all the blocks for the duration of
the walk.  (Can be a lot of blocks!).  If another mount somewhere else
tried to modify those ancestor blocks, that mount would need to revoke
the cached read access to be granted their write access.  That'd wait
for the first rename to finish and release the read refs.  This gives us
specific serialization of access to the blocks in question rather than
relying on a global serializing object over all renames.

That's the idea, anyway.  I'm implementing the first bits of this now.

It's sort of a silly example, because who puts cross-directory rename in
the fast path?  (Historically some s3<->posix servers implemented
CompleteMultipartUpload be renaming from tmp dirs to visible bucket
dirs, hrmph).  But it illustrates the pattern of shrinking contention
down to the block level.

- z