On Wed, 2021-06-30 at 09:26 +0800, Xiubo Li wrote: > On 6/29/21 9:25 PM, Jeff Layton wrote: > > On Tue, 2021-06-29 at 12:42 +0800, xiubli@xxxxxxxxxx wrote: > > > From: Xiubo Li <xiubli@xxxxxxxxxx> > > > > > > For the client requests who will have unsafe and safe replies from > > > MDS daemons, in the MDS side the MDS daemons won't flush the mdlog > > > (journal log) immediatelly, because they think it's unnecessary. > > > That's true for most cases but not all, likes the fsync request. > > > The fsync will wait until all the unsafe replied requests to be > > > safely replied. > > > > > > Normally if there have multiple threads or clients are running, the > > > whole mdlog in MDS daemons could be flushed in time if any request > > > will trigger the mdlog submit thread. So usually we won't experience > > > the normal operations will stuck for a long time. But in case there > > > has only one client with only thread is running, the stuck phenomenon > > > maybe obvious and the worst case it must wait at most 5 seconds to > > > wait the mdlog to be flushed by the MDS's tick thread periodically. > > > > > > This patch will trigger to flush the mdlog in all the MDSes manually > > > just before waiting the unsafe requests to finish. > > > > > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > > > --- > > > fs/ceph/caps.c | 9 +++++++++ > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > > index c6a3352a4d52..6e80e4649c7a 100644 > > > --- a/fs/ceph/caps.c > > > +++ b/fs/ceph/caps.c > > > @@ -2286,6 +2286,7 @@ static int caps_are_flushed(struct inode *inode, u64 flush_tid) > > > */ > > > static int unsafe_request_wait(struct inode *inode) > > > { > > > + struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc; > > > struct ceph_inode_info *ci = ceph_inode(inode); > > > struct ceph_mds_request *req1 = NULL, *req2 = NULL; > > > int ret, err = 0; > > > @@ -2305,6 +2306,14 @@ static int unsafe_request_wait(struct inode *inode) > > > } > > > spin_unlock(&ci->i_unsafe_lock); > > > > > > + /* > > > + * Trigger to flush the journal logs in all the MDSes manually, > > > + * or in the worst case we must wait at most 5 seconds to wait > > > + * the journal logs to be flushed by the MDSes periodically. > > > + */ > > > + if (req1 || req2) > > > + flush_mdlog(mdsc); > > > + > > So this is called on fsync(). Do we really need to flush all of the mds > > logs on every fsync? That sounds like it might have some performance > > impact. Would it be possible to just flush the mdslog on the MDS that's > > authoritative for this inode? > > I hit one case before, the mds.0 is the auth mds, but the client just > sent the request to mds.2, then when the mds.2 tried to gather the > rdlocks then it was stuck for waiting for the mds.0 to flush the mdlog. > I think it also will happen that if the mds.0 could also be stuck just > like this even its the auth MDS. > It sounds like mds.0 should flush its own mdlog in this situation once mds.2 started requesting locks that mds.0 was holding. Shouldn't it? > Normally the mdlog submit thread will be triggered per MDS's tick, > that's 5 seconds. But this is not always true mostly because any other > client request could trigger the mdlog submit thread to run at any time. > Since the fsync is not running all the time, so IMO the performance > impact should be okay. > > I'm not sure I'm convinced. Consider a situation where we have a large(ish) ceph cluster with several MDSs. One client is writing to a file that is on mds.0 and there is little other activity there. Several other clients are doing heavy I/O on other inodes (of which mds.1 is auth). The first client then calls fsync, and now the other clients stall for a bit while mds.1 unnecessarily flushes its mdlog. I think we need to take care to only flush the mdlog for mds's that we care about here. > > > > > dout("unsafe_request_wait %p wait on tid %llu %llu\n", > > > inode, req1 ? req1->r_tid : 0ULL, req2 ? req2->r_tid : 0ULL); > > > if (req1) { > -- Jeff Layton <jlayton@xxxxxxxxxx>