On Wed, 2021-07-21 at 21:42 +0800, Xiubo Li wrote: > On 7/21/21 8:57 PM, Jeff Layton wrote: > > On Wed, 2021-07-21 at 19:54 +0800, Xiubo Li wrote: > > > On 7/21/21 7:23 PM, Jeff Layton wrote: > > > > On Wed, 2021-07-21 at 16:27 +0800, xiubli@xxxxxxxxxx wrote: > > > > > From: Xiubo Li <xiubli@xxxxxxxxxx> > > > > > > > > > > The delayed_work will be executed per 5 seconds, during this time > > > > > the cap_delay_list may accumulate thounsands of caps need to flush, > > > > > this will make the MDS's dispatch queue be full and need a very long > > > > > time to handle them. And if there has some other operations, likes > > > > > a rmdir request, it will be add in the tail of dispath queue and > > > > > need to wait for several or tens of seconds. > > > > > > > > > > In client side we shouldn't queue to many of the cap requests and > > > > > flush them if there has more than 100 items. > > > > > > > > > Why 100? My inclination is to say NAK on this. > > > This just from my test that around 100 client_caps requests queued will > > > work fine in most cases, which won't take too long to handle. Some times > > > the client will send thousands of requests in a short time, that will be > > > a problem. > > What may be a better approach is to figure out why we're holding on to > > so many caps and trying to flush them all at once. Maybe if we were to > > more aggressively flush sooner, we'd not end up with such a backlog? > > Yeah, I am still working on this. > > From my analysis, some fixing or improvement like this will help reduce it: > > commit e64f44a884657358812e6c057957be546db03cbe > Author: Xiubo Li <xiubli@xxxxxxxxxx> > Date: Wed May 27 09:09:27 2020 -0400 > > ceph: skip checking caps when session reconnecting and releasing reqs > > It make no sense to check the caps when reconnecting to mds. And > for the async dirop caps, they will be put by its _cb() function, > so when releasing the requests, it will make no sense too. > > From the test result, it seems that some caps related requests were > still pending in the cap_delay_list even after the rmdir requests were > already sent out and finished for the previous directories. I am still > confirming this. > Ok, that at least sounds superficially similar to this bug that Patrick and I have been chasing: https://tracker.ceph.com/issues/51279 It's possible that some of these have the same root cause. > > > > > > > > I'm not a fan of this sort of arbitrary throttling in order to alleviate > > > > a scalability problem in the MDS. The cap delay logic is already > > > > horrifically complex as well, and this just makes it worse. If the MDS > > > > happens to be processing requests from a lot of clients, this may not > > > > help at all. > > > Yeah, right, in that case the mds dispatch queue possibly will have more > > > in a short time. > > > > > > > > > > > URL: https://tracker.ceph.com/issues/51734 > > > > The core problem sounds like the MDS got a call and just took way too > > > > long to get to it. Shouldn't we be addressing this in the MDS instead? > > > > It sounds like it needs to do a better job of parallelizing cap flushes. > > > The problem is that almost all the requests need to acquire the global > > > 'mds_lock', so the parallelizing won't improve much. > > > > > Yuck. That sounds like the root cause right there. For the record, I'm > > against papering over that with client patches. > > > > > Currently all the client side requests will have the priority of > > > CEPH_MSG_PRIO_DEFAULT: > > > > > > m->hdr.priority = cpu_to_le16(CEPH_MSG_PRIO_DEFAULT); > > > > > > Not sure whether can we just make the cap flushing request a lower > > > priority ? If this won't introduce other potential issue, that will work > > > because in the MDS dispatch queue it will handle high priority requests > > > first. > > > > > Tough call. You might mark it for lower priority, but then someone calls > > fsync() and suddenly you need it to finish sooner. I think that sort of > > change might backfire in some cases. > > Yeah, I am also worrying about this. > > > > > > Usually when faced with this sort of issue, the right answer is to try > > to work ahead of the "spikes" in activity so that you don't have quite > > such a backlog when you do need to flush everything out. I'm not sure > > how possible that is here. > > > > One underlying issue may be that some of the client's background > > activity is currently driven by arbitrary timers (e.g. the delayed_work > > timer). When the timer pops, it suddenly has to do a bunch of work, when > > some of it could have been done earlier so that the load was spread out. > > > > > > > > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > > > > > --- > > > > > fs/ceph/caps.c | 21 ++++++++++++++++++++- > > > > > fs/ceph/mds_client.c | 3 ++- > > > > > fs/ceph/mds_client.h | 3 +++ > > > > > 3 files changed, 25 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > > > > index 4b966c29d9b5..064865761d2b 100644 > > > > > --- a/fs/ceph/caps.c > > > > > +++ b/fs/ceph/caps.c > > > > > @@ -507,6 +507,8 @@ static void __cap_set_timeouts(struct ceph_mds_client *mdsc, > > > > > static void __cap_delay_requeue(struct ceph_mds_client *mdsc, > > > > > struct ceph_inode_info *ci) > > > > > { > > > > > + int num = 0; > > > > > + > > > > > dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode, > > > > > ci->i_ceph_flags, ci->i_hold_caps_max); > > > > > if (!mdsc->stopping) { > > > > > @@ -515,12 +517,19 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc, > > > > > if (ci->i_ceph_flags & CEPH_I_FLUSH) > > > > > goto no_change; > > > > > list_del_init(&ci->i_cap_delay_list); > > > > > + mdsc->num_cap_delay--; > > > > > } > > > > > __cap_set_timeouts(mdsc, ci); > > > > > list_add_tail(&ci->i_cap_delay_list, &mdsc->cap_delay_list); > > > > > + num = ++mdsc->num_cap_delay; > > > > > no_change: > > > > > spin_unlock(&mdsc->cap_delay_lock); > > > > > } > > > > > + > > > > > + if (num > 100) { > > > > > + flush_delayed_work(&mdsc->delayed_work); > > > > > + schedule_delayed(mdsc); > > > > > + } > > > > > } > > > > > > > > > > /* > > > > > @@ -531,13 +540,23 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc, > > > > > static void __cap_delay_requeue_front(struct ceph_mds_client *mdsc, > > > > > struct ceph_inode_info *ci) > > > > > { > > > > > + int num; > > > > > + > > > > > dout("__cap_delay_requeue_front %p\n", &ci->vfs_inode); > > > > > spin_lock(&mdsc->cap_delay_lock); > > > > > ci->i_ceph_flags |= CEPH_I_FLUSH; > > > > > - if (!list_empty(&ci->i_cap_delay_list)) > > > > > + if (!list_empty(&ci->i_cap_delay_list)) { > > > > > list_del_init(&ci->i_cap_delay_list); > > > > > + mdsc->num_cap_delay--; > > > > > + } > > > > > list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list); > > > > > + num = ++mdsc->num_cap_delay; > > > > > spin_unlock(&mdsc->cap_delay_lock); > > > > > + > > > > > + if (num > 100) { > > > > > + flush_delayed_work(&mdsc->delayed_work); > > > > > + schedule_delayed(mdsc); > > > > > + } > > > > > } > > > > > > > > > > /* > > > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > > > > > index 79aa4ce3a388..14e44de05812 100644 > > > > > --- a/fs/ceph/mds_client.c > > > > > +++ b/fs/ceph/mds_client.c > > > > > @@ -4514,7 +4514,7 @@ void inc_session_sequence(struct ceph_mds_session *s) > > > > > /* > > > > > * delayed work -- periodically trim expired leases, renew caps with mds > > > > > */ > > > > > -static void schedule_delayed(struct ceph_mds_client *mdsc) > > > > > +void schedule_delayed(struct ceph_mds_client *mdsc) > > > > > { > > > > > int delay = 5; > > > > > unsigned hz = round_jiffies_relative(HZ * delay); > > > > > @@ -4616,6 +4616,7 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) > > > > > mdsc->request_tree = RB_ROOT; > > > > > INIT_DELAYED_WORK(&mdsc->delayed_work, delayed_work); > > > > > mdsc->last_renew_caps = jiffies; > > > > > + mdsc->num_cap_delay = 0; > > > > > INIT_LIST_HEAD(&mdsc->cap_delay_list); > > > > > INIT_LIST_HEAD(&mdsc->cap_wait_list); > > > > > spin_lock_init(&mdsc->cap_delay_lock); > > > > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h > > > > > index a7af09257382..b4289b8d23ec 100644 > > > > > --- a/fs/ceph/mds_client.h > > > > > +++ b/fs/ceph/mds_client.h > > > > > @@ -423,6 +423,7 @@ struct ceph_mds_client { > > > > > struct rb_root request_tree; /* pending mds requests */ > > > > > struct delayed_work delayed_work; /* delayed work */ > > > > > unsigned long last_renew_caps; /* last time we renewed our caps */ > > > > > + unsigned long num_cap_delay; /* caps in the cap_delay_list */ > > > > > struct list_head cap_delay_list; /* caps with delayed release */ > > > > > spinlock_t cap_delay_lock; /* protects cap_delay_list */ > > > > > struct list_head snap_flush_list; /* cap_snaps ready to flush */ > > > > > @@ -568,6 +569,8 @@ extern int ceph_trim_caps(struct ceph_mds_client *mdsc, > > > > > struct ceph_mds_session *session, > > > > > int max_caps); > > > > > > > > > > +extern void schedule_delayed(struct ceph_mds_client *mdsc); > > > > > + > > > > > static inline int ceph_wait_on_async_create(struct inode *inode) > > > > > { > > > > > struct ceph_inode_info *ci = ceph_inode(inode); > -- Jeff Layton <jlayton@xxxxxxxxxx>