On Fri, Sep 17, 2021 at 12:14 AM Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > > > > On 9/15/21 11:05 PM, Yan, Zheng wrote: > > On Wed, Sep 15, 2021 at 8:36 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > >> > >> Hi Zheng, > >> > >> > >> This looks great! Have you noticed any slow performance during > >> directory splitting? One of the things I was playing around with last > >> year was pre-fragmenting directories based on a user supplied hint that > >> the directory would be big (falling back to normal behavior if it grows > >> beyond the hint size). That way you can create the dirfrags upfront and > >> do the migration before they ever have any associated files. Do you > >> think that might be worth trying again given your PRs below? > >> > > > > These PRs do not change directory splitting logic. It's unlikely they > > will improve performance number of mdtest hard test. But these PRs > > remove overhead of journaling subtreemap and distribute metadata more > > evenly. They should improve performance number of mdtest easy test. > > So I think it's worth a retest. > > > > Yan, Zheng > > > I was mostly thinking about: > > [3] https://github.com/ceph/ceph/pull/43125 > > Shouldn't this allow workloads like mdtest hard where you have many > clients performing file writes/reads/deletes inside a single directory > (that is split into dirfrags randomly distributed across MDSes) to > parallelize some of the work? (minus whatever needs to be synchronized > on the authoritative mds) > The tiggers for dirfrags migration in this PR are mkdir and dirfrag fetch. Dirfrag first need to be split, then get migrated. I don't hnow how long these events happen in mdtest hard and how the pause of split/migration affect the test result. > We discussed some of this in the performance standup today. From what > I've seen the real meat of the problem still rests in the distributed > cache, locking, and cap revocation, For performance of single thread or single MDS, yes. The purpose of PR 43125 is distribute metadata more evenly and improve aggregate performance. Yan, Zheng > but it seems like anything we can do > to reduce the overhead of dirfrag migration is a win. > > Mark > > > > > > > >> > >> Mark > >> > >> > >> On 9/15/21 2:21 AM, Yan, Zheng wrote: > >>> Following PRs are optimization we (Kuaishou) made for machine learning > >>> workloads (randomly read billions of small files) . > >>> > >>> [1] https://github.com/ceph/ceph/pull/39315 > >>> [2] https://github.com/ceph/ceph/pull/43126 > >>> [3] https://github.com/ceph/ceph/pull/43125 > >>> > >>> The first PR adds an option that disables dirfrag prefetch. When files > >>> are accessed randomly, dirfrag prefetch adds lots of useless files to > >>> cache and causes cache thrash. Performance of MDS can be dropped below > >>> 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval > >>> request to rados for cache missed lookup. Single mds can handle about > >>> 6k cache missed lookup requests per second (all ssd metadata pool). > >>> > >>> The second PR optimizes MDS performance for a large number of clients > >>> and a large number of read-only opened files. It also can greatly > >>> reduce mds recovery time for read-mostly wordload. > >>> > >>> The third PR makes MDS cluster randomly distribute all dirfrags. MDS > >>> uses consistent hash to calculate target rank for each dirfrag. > >>> Compared to dynamic balancer and subtree pin, metadata can be > >>> distributed among MDSs more evenly. Besides, MDS only migrates single > >>> dirfrag (instead of big subtree) for load balancing. So MDS has > >>> shorter pause when doing metadata migration. The drawbacks of this > >>> change are: stat(2) directory can be slow; rename(2) file to > >>> different directory can be slow. The reason is, with random dirfrag > >>> distribution, these operations likely involve multiple MDS. > >>> > >>> Above three PRs are all merged into an integration branch > >>> https://github.com/ukernel/ceph/tree/wip-mds-integration. > >>> > >>> We (Kuaishou) have run these codes for months, 16 active MDS cluster > >>> serve billions of small files. In file random read test, single MDS > >>> can handle about 6k ops, performance increases linearly with the > >>> number of active MDS. In file creation test (mpirun -np 160 -host > >>> xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active > >>> MDS can serve over 100k file creation per second. > >>> > >>> Yan, Zheng > >>> > >>