Re: MDS performance with 10 billion small sized files

Deepika Upadhyay <deepikaupadhyay01@xxxxxxxxx> · Mon, 5 Dec 2022 16:47:51 +0530

Hey Anthony,

On Thu, Dec 1, 2022 at 7:50 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
I think what Gaurav is proposing is only 3000 files *as Ceph sees them*, with a filesystem structure within each.  This is very similar to a thread from around a year ago from the Software Heritage folks.
Though in one spot he describes 10 billion files total, and in another it seems like 600 billion.

Sorry for the confusion, to be more precise here, the workload looks more like:
* An average of 100 million files in each volume and volumes scaling 1k volumes making it around 10 billion at minimum filesystem, with a major portion of them being files of 1-10kb
* Around 1000 nested directories, with around 100, 000 files/directory
* Around 100,000 I/Ops for reads and with the majority of which being append-only, read-intensive workload.

Are there any limitations to this kind of workload in CephFS or is this something we can work out?
Any input would be appreciated! Thanks!

Since CSI is mentioned, I think RBD would fit.

One issue with RGW here is the object size.  The 10 Billion Object project uses 64KB S3 objects and IIRC ran afoul of the min_alloc_size factor.  Gaurav's files sizes are described as 1-10KB which would be subject to massive space amplification I suspect that 1KB might be an impractical value for min_alloc_size to compensate.

Gaurav, might you share more details about the nature of the workload?

On Dec 1, 2022, at 8:13 AM, Josh Salomon <jsalomon@xxxxxxxxxx> wrote:

As I understand it the workload you describe is way more suitable for object (S3) storage than for file. These numbers were also tested successfully with RGW.  
Regards,
Josh

On Thu, Dec 1, 2022 at 3:07 PM Gaurav Sitlani <sitlanigaurav7@xxxxxxxxx> wrote:
Hey Cephers,

We have a CephFS use case with the following workload requirements with around 10 Billion small sized files each around 1 to 10 kb 
Workloads might be very write intensive. For example when we need to perform backup recovery that writes hundreds of millions of files or large-scale imports that also happen from time to time.
Hierarchy is the following: CephFS CSI Kubernetes volumes that can reach a billion or more files each.
We expect up to 100,000 files/directories.

Overall it would consist of 3000 CephFS volumes each of size about 20 to 30 TB based on PVC each having about 200 million files.

While having a discussion with Dan he mentioned one fear about one FS with 10B files is that it would be impractical to scrub -- it would take too long.
And suggested that we should split this into several small clusters for this particular reason alone.

We are looking forward to documentation, case studies and MDS tuning, configuration references as well as examples if anyone has any knowledge or suggestions about such a workload.

We also want to understand if there are any limitations as well if anyone has tested such a kind of workload in a Ceph cluster.

Kind regards,
Gaurav
_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

-- 
Regards.
Deepika
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx