On Mon, Oct 18, 2021 at 3:55 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > On Mon, Oct 18, 2021 at 9:23 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > > > > > On Fri, Oct 15, 2021 at 6:06 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > >> > >> Hi Zheng, > >> > >> Thanks for this really nice set of PRs -- we will try them at our site > >> the next weeks and try to come back with practical feedback. > >> A few questions: > >> > >> 1. How many clients did you scale to, with improvements in the 2nd PR? > > > > > > We have FS clusters with over 10k clients. If you find CInode::get_caps_{issued,wanted} and/or EOpen::encode use lots of CPU, That PR should help. > > That's an impressive number, relevant for our possible future plans. > > > > >> 2. Do these PRs improve the process of scaling up/down the number of active MDS? > > > > > > What problem you encountered? Decreasing active MDS works well (although a little slow) in my local test. migrating big subtree (after increasing active MDS) can cause slow OPS, the 3rd PR solves it. > > Stopping has in the past taken ~30mins, with slow requests while the > pinned subtrees are re-imported. This case should be improved by the 3rd PR. subtree migrations are smoother and smoother. > > > -- dan > > > > > >> > >> > >> Thanks! > >> > >> Dan > >> > >> > >> On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > >> > > >> > Following PRs are optimization we (Kuaishou) made for machine learning > >> > workloads (randomly read billions of small files) . > >> > > >> > [1] https://github.com/ceph/ceph/pull/39315 > >> > [2] https://github.com/ceph/ceph/pull/43126 > >> > [3] https://github.com/ceph/ceph/pull/43125 > >> > > >> > The first PR adds an option that disables dirfrag prefetch. When files > >> > are accessed randomly, dirfrag prefetch adds lots of useless files to > >> > cache and causes cache thrash. Performance of MDS can be dropped below > >> > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval > >> > request to rados for cache missed lookup. Single mds can handle about > >> > 6k cache missed lookup requests per second (all ssd metadata pool). > >> > > >> > The second PR optimizes MDS performance for a large number of clients > >> > and a large number of read-only opened files. It also can greatly > >> > reduce mds recovery time for read-mostly wordload. > >> > > >> > The third PR makes MDS cluster randomly distribute all dirfrags. MDS > >> > uses consistent hash to calculate target rank for each dirfrag. > >> > Compared to dynamic balancer and subtree pin, metadata can be > >> > distributed among MDSs more evenly. Besides, MDS only migrates single > >> > dirfrag (instead of big subtree) for load balancing. So MDS has > >> > shorter pause when doing metadata migration. The drawbacks of this > >> > change are: stat(2) directory can be slow; rename(2) file to > >> > different directory can be slow. The reason is, with random dirfrag > >> > distribution, these operations likely involve multiple MDS. > >> > > >> > Above three PRs are all merged into an integration branch > >> > https://github.com/ukernel/ceph/tree/wip-mds-integration. > >> > > >> > We (Kuaishou) have run these codes for months, 16 active MDS cluster > >> > serve billions of small files. In file random read test, single MDS > >> > can handle about 6k ops, performance increases linearly with the > >> > number of active MDS. In file creation test (mpirun -np 160 -host > >> > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active > >> > MDS can serve over 100k file creation per second. > >> > > >> > Yan, Zheng