Re: CephFS optimizated for machine learning workload

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 15 Oct 2021 12:05:51 +0200

Hi Zheng,

Thanks for this really nice set of PRs -- we will try them at our site
the next weeks and try to come back with practical feedback.
A few questions:

1. How many clients did you scale to, with improvements in the 2nd PR?
2. Do these PRs improve the process of scaling up/down the number of active MDS?

Thanks!

Dan

On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>
> Following PRs are optimization we (Kuaishou) made for machine learning
> workloads (randomly read billions of small files) .
>
> [1] https://github.com/ceph/ceph/pull/39315
> [2] https://github.com/ceph/ceph/pull/43126
> [3] https://github.com/ceph/ceph/pull/43125
>
> The first PR adds an option that disables dirfrag prefetch. When files
> are accessed randomly, dirfrag prefetch adds lots of useless files to
> cache and causes cache thrash. Performance of MDS can be dropped below
> 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> request to rados for cache missed lookup.  Single mds can handle about
> 6k cache missed lookup requests per second (all ssd metadata pool).
>
> The second PR optimizes MDS performance for a large number of clients
> and a large number of read-only opened files. It also can greatly
> reduce mds recovery time for read-mostly wordload.
>
> The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> uses consistent hash to calculate target rank for each dirfrag.
> Compared to dynamic balancer and subtree pin, metadata can be
> distributed among MDSs more evenly. Besides, MDS only migrates single
> dirfrag (instead of big subtree) for load balancing. So MDS has
> shorter pause when doing metadata migration.  The drawbacks of this
> change are:  stat(2) directory can be slow; rename(2) file to
> different directory can be slow. The reason is, with random dirfrag
> distribution, these operations likely involve multiple MDS.
>
> Above three PRs are all merged into an integration branch
> https://github.com/ukernel/ceph/tree/wip-mds-integration.
>
> We (Kuaishou) have run these codes for months, 16 active MDS cluster
> serve billions of small files. In file random read test, single MDS
> can handle about 6k ops,  performance increases linearly with the
> number of active MDS.  In file creation test (mpirun -np 160 -host
> xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> MDS can serve over 100k file creation per second.
>
> Yan, Zheng