Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file

Ric Wheeler <ricwheeler@xxxxxxxxx> · Wed, 5 Feb 2025 09:05:11 +0100

On 2/4/25 2:47 AM, Dave Chinner wrote:
On Mon, Feb 03, 2025 at 05:18:48PM +0100, Ric Wheeler wrote:
On 2/3/25 4:22 PM, Amir Goldstein wrote:
On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@xxxxxxxxx> wrote:
I have always been super interested in how much we can push the
scalability limits of file systems and for the workloads we need to
support, we need to scale up to supporting absolutely ridiculously large
numbers of files (a few billion files doesn't meet the need of the
largest customers we support).

Hi Ric,

Since LSFMM is not about presentations, it would be better if the topic to
discuss was trying to address specific technical questions that developers
could discuss.
Totally agree - from the ancient history of LSF (before MM or BPF!) we also
pushed for discussions over talks.

If a topic cannot generate a discussion on the list, it is not very
likely that it will
generate a discussion on-prem.

Where does the scaling with the number of files in a filesystem affect existing
filesystems? What are the limitations that you need to overcome?
Local file systems like xfs running on "scale up" giant systems (think of
the old super sized HP Superdomes and the like) would be likely to handle
this well.
We don't need "Big Iron" hardware to scale up to tens of billions of
files in a single filesystem these days. A cheap server with 32p and
a couple of hundred GB of RAM and a few NVMe SSDs is all that is
really needed. We recently had a XFS user report over 16 billion
files in a relatively small filesystem (a few tens of TB), most of
which were reflink copied files (backup/archival storage farm).

So, yeah, large file counts (i.e. tens of billions) in production
systems aren't a big deal these days. There shouldn't be any
specific issues at the OS/VFS layers supporting filesystems with
inode counts in the billions - most of the problems with this are
internal fielsystem implementation issues. If there are any specific
VFS level scalability issues you've come across, I'm all ears...

-Dave.

I remember fondly torturing xfs (and ext4 and btrfs) many years back 
with a billion small (empty) files on a sata drive :)

For our workload though, we have a couple of requirements that prevent 
most customers from using a single server.

First requirement is the need to keep a scary number of large tape 
drives/robots running at line rate - keeping all of those busy normally 
requires order of 5 servers with our existing stack but larger systems 
can need more.

Second requirement is the need for high availability - that lead us to 
using a shared disk back file system (scoutfs) - but others in this 
space have used cxfs and similar non-open source file systems. The 
shared disk/cluster file systems are where the coarse grain locking 
comes into conflict with concurrency.

What ngnfs is driving towards is to be able to drive that bandwidth 
requirement for the backend archival work flow, support the many 
billions of file objects in a high availability system made with today's 
cutting edge components.  Zach will jump in once he gets back but my 
hand wavy way of thinking of this is that ngnfs as a distributed file 
system is closer in design to how xfs would run on a huge system with 
coherence between NUMA zones.

regards,

Ric