Re: CephFS optimizated for machine learning workload

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 18 Oct 2021 09:54:27 +0200

On Mon, Oct 18, 2021 at 9:23 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>
>
>
> On Fri, Oct 15, 2021 at 6:06 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>
>> Hi Zheng,
>>
>> Thanks for this really nice set of PRs -- we will try them at our site
>> the next weeks and try to come back with practical feedback.
>> A few questions:
>>
>> 1. How many clients did you scale to, with improvements in the 2nd PR?
>
>
> We have FS clusters with over 10k clients.  If you find CInode::get_caps_{issued,wanted} and/or EOpen::encode use lots of CPU, That PR should help.

That's an impressive number, relevant for our possible future plans.

>
>> 2. Do these PRs improve the process of scaling up/down the number of active MDS?
>
>
> What problem you encountered?  Decreasing active MDS works well (although a little slow) in my local test. migrating big subtree (after increasing active MDS) can cause slow OPS, the 3rd PR solves it.

Stopping has in the past taken ~30mins, with slow requests while the
pinned subtrees are re-imported.

-- dan

>
>>
>>
>> Thanks!
>>
>> Dan
>>
>>
>> On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>> >
>> > Following PRs are optimization we (Kuaishou) made for machine learning
>> > workloads (randomly read billions of small files) .
>> >
>> > [1] https://github.com/ceph/ceph/pull/39315
>> > [2] https://github.com/ceph/ceph/pull/43126
>> > [3] https://github.com/ceph/ceph/pull/43125
>> >
>> > The first PR adds an option that disables dirfrag prefetch. When files
>> > are accessed randomly, dirfrag prefetch adds lots of useless files to
>> > cache and causes cache thrash. Performance of MDS can be dropped below
>> > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
>> > request to rados for cache missed lookup.  Single mds can handle about
>> > 6k cache missed lookup requests per second (all ssd metadata pool).
>> >
>> > The second PR optimizes MDS performance for a large number of clients
>> > and a large number of read-only opened files. It also can greatly
>> > reduce mds recovery time for read-mostly wordload.
>> >
>> > The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
>> > uses consistent hash to calculate target rank for each dirfrag.
>> > Compared to dynamic balancer and subtree pin, metadata can be
>> > distributed among MDSs more evenly. Besides, MDS only migrates single
>> > dirfrag (instead of big subtree) for load balancing. So MDS has
>> > shorter pause when doing metadata migration.  The drawbacks of this
>> > change are:  stat(2) directory can be slow; rename(2) file to
>> > different directory can be slow. The reason is, with random dirfrag
>> > distribution, these operations likely involve multiple MDS.
>> >
>> > Above three PRs are all merged into an integration branch
>> > https://github.com/ukernel/ceph/tree/wip-mds-integration.
>> >
>> > We (Kuaishou) have run these codes for months, 16 active MDS cluster
>> > serve billions of small files. In file random read test, single MDS
>> > can handle about 6k ops,  performance increases linearly with the
>> > number of active MDS.  In file creation test (mpirun -np 160 -host
>> > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
>> > MDS can serve over 100k file creation per second.
>> >
>> > Yan, Zheng