Re: CephFS optimizated for machine learning workload

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Thu, 16 Sep 2021 11:14:28 -0500

On 9/15/21 11:05 PM, Yan, Zheng wrote:
On Wed, Sep 15, 2021 at 8:36 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

Hi Zheng,

This looks great!  Have you noticed any slow performance during
directory splitting?  One of the things I was playing around with last
year was pre-fragmenting directories based on a user supplied hint that
the directory would be big (falling back to normal behavior if it grows
beyond the hint size).  That way you can create the dirfrags upfront and
do the migration before they ever have any associated files.  Do you
think that might be worth trying again given your PRs below?

These PRs do not change directory splitting logic. It's unlikely they
will improve performance number of mdtest hard test.  But these PRs
remove overhead of journaling  subtreemap and distribute metadata more
evenly.  They should improve performance number of mdtest easy test.
So I think it's worth a retest.

Yan, Zheng

I was mostly thinking about:

[3] https://github.com/ceph/ceph/pull/43125

Shouldn't this allow workloads like mdtest hard where you have many 
clients performing file writes/reads/deletes inside a single directory 
(that is split into dirfrags randomly distributed across MDSes) to 
parallelize some of the work? (minus whatever needs to be synchronized 
on the authoritative mds)

We discussed some of this in the performance standup today.  From what 
I've seen the real meat of the problem still rests in the distributed 
cache, locking, and cap revocation, but it seems like anything we can do 
to reduce the overhead of dirfrag migration is a win.

Mark

Mark

On 9/15/21 2:21 AM, Yan, Zheng wrote:
Following PRs are optimization we (Kuaishou) made for machine learning
workloads (randomly read billions of small files) .

[1] https://github.com/ceph/ceph/pull/39315
[2] https://github.com/ceph/ceph/pull/43126
[3] https://github.com/ceph/ceph/pull/43125

The first PR adds an option that disables dirfrag prefetch. When files
are accessed randomly, dirfrag prefetch adds lots of useless files to
cache and causes cache thrash. Performance of MDS can be dropped below
100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
request to rados for cache missed lookup.  Single mds can handle about
6k cache missed lookup requests per second (all ssd metadata pool).

The second PR optimizes MDS performance for a large number of clients
and a large number of read-only opened files. It also can greatly
reduce mds recovery time for read-mostly wordload.

The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
uses consistent hash to calculate target rank for each dirfrag.
Compared to dynamic balancer and subtree pin, metadata can be
distributed among MDSs more evenly. Besides, MDS only migrates single
dirfrag (instead of big subtree) for load balancing. So MDS has
shorter pause when doing metadata migration.  The drawbacks of this
change are:  stat(2) directory can be slow; rename(2) file to
different directory can be slow. The reason is, with random dirfrag
distribution, these operations likely involve multiple MDS.

Above three PRs are all merged into an integration branch
https://github.com/ukernel/ceph/tree/wip-mds-integration.

We (Kuaishou) have run these codes for months, 16 active MDS cluster
serve billions of small files. In file random read test, single MDS
can handle about 6k ops,  performance increases linearly with the
number of active MDS.  In file creation test (mpirun -np 160 -host
xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
MDS can serve over 100k file creation per second.

Yan, Zheng