Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 6 Feb 2025 18:36:40 -0500

Lustre has production filesystems with hundreds of billions of files today, with
coherent renames running across dozens of servers. 

We've relaxed the rename locking at the server to allow concurrent rename for regular
files within the same server, and directories that stay within the same parent (so cannot
break the namespace hierarchy).  They are still subject the VFS serialization on a single
client node, but hopefully Neil's parallel dirops patch will eventually land. 

Cheers, Andreas

> On Feb 6, 2025, at 13:59, Zach Brown <zab@xxxxxxxxx> wrote:
> 
> 
> (Yay, back from travel!)
> 
>> On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote:
>>> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@xxxxxxxxx> wrote:
>>> 
>>> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
>>> a good background on this -
>>> https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
>>> looking at taking advantage of modern low latency NVME devices and
>>> today's networks to implement a distributed file system that provides
>>> better concurrency that high object counts need and still have the
>>> bandwidth needed to support the backend archival systems we feed.
>>> 
>> 
>> I heard this talk and it was very interesting.
>> Here's a direct link to slides from people who may be too lazy to
>> follow 3 clicks:
>> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf
>> 
>> I was both very impressed by the cache coherent rename example
>> and very puzzled - I do not know any filesystem where rename can be
>> synchronized on a single block io, and looking up ancestors is usually
>> done on in-memory dentries, so I may not have understood the example.
> 
> The meat of that talk was about how ngnfs uses its distributed block
> cache as a serializing/coherence/consistency mechanism.  That specific
> example was about how we can get concurrent rename between different
> mounts without needing some global equivelant of rename mutex.
> 
> The core of the mechanism is that code paths that implement operations
> have a transactional object that holds on to cached block references
> which have a given access mode granted over the network.  In this rename
> case, the ancestor walk holds on to all the blocks for the duration of
> the walk.  (Can be a lot of blocks!).  If another mount somewhere else
> tried to modify those ancestor blocks, that mount would need to revoke
> the cached read access to be granted their write access.  That'd wait
> for the first rename to finish and release the read refs.  This gives us
> specific serialization of access to the blocks in question rather than
> relying on a global serializing object over all renames.
> 
> That's the idea, anyway.  I'm implementing the first bits of this now.
> 
> It's sort of a silly example, because who puts cross-directory rename in
> the fast path?  (Historically some s3<->posix servers implemented
> CompleteMultipartUpload be renaming from tmp dirs to visible bucket
> dirs, hrmph).  But it illustrates the pattern of shrinking contention
> down to the block level.
> 
> - z
>