Hi Olga, ----- Original Message ----- > From: "Olga Kornievskaia" <olga.kornievskaia@xxxxxxxxx> > To: "trondmy" <trondmy@xxxxxxxxxxxxxxx> > Cc: "Anna Schumaker" <anna.schumaker@xxxxxxxxxx>, "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> > Sent: Tuesday, 31 May, 2022 18:03:34 > Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error > On Tue, May 31, 2022 at 11:00 AM Trond Myklebust > <trondmy@xxxxxxxxxxxxxxx> wrote: >> >> On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote: >> > From: Olga Kornievskaia <kolga@xxxxxxxxxx> >> > >> > In recent pnfs testing we've incountered IO thread starvation problem >> > during the time when the server returns LAYOUTUNAVAILABLE error to >> > the >> > client. When that happens each IO request tries to get a new layout >> > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET >> > RPC is outstanding, the rest would be waiting. As the thread that >> > gets >> > the layout wakes up the waiters only one gets to run and it tends to >> > be >> > the latest added to the waiting queue. After receiving >> > LAYOUTUNAVAILABLE >> > error the client would fall back to the MDS writes and as those >> > writes >> > complete and the new write is issued, those requests are added as >> > waiters and they get to run before the earliest of the waiters that >> > was put on the queue originally never gets to run until the >> > LAYOUTUNAVAILABLE condition resolves itself on the server. >> > >> > With the current code, if N IOs arrive asking for a layout, then >> > there will be N serial LAYOUTGETs that will follow, each would be >> > getting its own LAYOUTUNAVAILABLE error. Instead, the patch proposes >> > to apply the error condition to ALL the waiters for the outstanding >> > LAYOUTGET. Once the error is received, the code would allow all >> > exiting N IOs fall back to the MDS, but any new arriving IOs would be >> > then queued up and one them the new IO would trigger a new LAYOUTGET. >> > >> > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> >> > --- >> > fs/nfs/pnfs.c | 14 +++++++++++++- >> > fs/nfs/pnfs.h | 2 ++ >> > 2 files changed, 15 insertions(+), 1 deletion(-) >> > >> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c >> > index 68a87be3e6f9..5b7a679e32c8 100644 >> > --- a/fs/nfs/pnfs.c >> > +++ b/fs/nfs/pnfs.c >> > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino, >> > if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo)) >> > && >> > atomic_read(&lo->plh_outstanding) != 0) { >> > spin_unlock(&ino->i_lock); >> > + atomic_inc(&lo->plh_waiting); >> > lseg = ERR_PTR(wait_var_event_killable(&lo- >> > >plh_outstanding, >> > !atomic_read(&lo- >> > >plh_outstanding))); >> > - if (IS_ERR(lseg)) >> > + if (IS_ERR(lseg)) { >> > + atomic_dec(&lo->plh_waiting); >> > goto out_put_layout_hdr; >> > + } >> > + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) { >> > + pnfs_layout_clear_fail_bit(lo, >> > pnfs_iomode_to_fail_bit(iomode)); >> > + lseg = NULL; >> > + if (atomic_dec_and_test(&lo->plh_waiting)) >> > + clear_bit(NFS_LAYOUT_DRAIN, &lo- >> > >plh_flags); >> > + goto out_put_layout_hdr; >> > + } >> > pnfs_put_layout_hdr(lo); >> > goto lookup_again; >> > } >> > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino, >> > case -ERECALLCONFLICT: >> > case -EAGAIN: >> > break; >> > + case -ENODATA: >> > + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags); >> > default: >> > if (!nfs_error_is_fatal(PTR_ERR(lseg))) { >> > pnfs_layout_clear_fail_bit(lo, >> > pnfs_iomode_to_fail_bit(iomode)); >> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h >> > index 07f11489e4e9..5c07da32320b 100644 >> > --- a/fs/nfs/pnfs.h >> > +++ b/fs/nfs/pnfs.h >> > @@ -105,6 +105,7 @@ enum { >> > NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first layoutget >> > */ >> > NFS_LAYOUT_INODE_FREEING, /* The inode is being freed >> > */ >> > NFS_LAYOUT_HASHED, /* The layout visible */ >> > + NFS_LAYOUT_DRAIN, >> > }; >> > >> > enum layoutdriver_policy_flags { >> > @@ -196,6 +197,7 @@ struct pnfs_commit_ops { >> > struct pnfs_layout_hdr { >> > refcount_t plh_refcount; >> > atomic_t plh_outstanding; /* number of RPCs >> > out */ >> > + atomic_t plh_waiting; >> > struct list_head plh_layouts; /* other client >> > layouts */ >> > struct list_head plh_bulk_destroy; >> > struct list_head plh_segs; /* layout segments >> > list */ >> >> According to the spec, the correct behaviour for handling >> NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the inode, >> and to fall back to doing I/O through the MDS. The error describes a >> more or less permanent state of the server being unable to hand out a >> layout for this file. >> If the server wanted the clients to retry after a delay, it should be >> returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly. > > To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn off > the inode permanently but for a period of time? > > It looks to me that for the LAYOUTTRYLATER, the client would face the > same starvation problem in the same situation. I don't see anything > marking the segment failed for such error? I believe the client > returns nolayout for that error, falls back to MDS but allows asking > for the layout for a period of time, having again the queue of waiters Your assumption doesn't match to our observation. For files that off-line (DS down or file is on tape) we return LAYOUTTRYLATER. Usually, client keep re-trying LAYOUTGET until file is available again. We use flexfile layout as nfs4_file has less predictable behavior. For files that should be served by MDS we return LAYOUTUNAVAILABLE. Typically, those files are quite small and served with a single READ request, so I haven't observe repetitive LAYOUTGET calls. Best regards, Tigran. > that gets manipulated as such that favors last added. > > >> Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE is >> just plain wrong: we just return no layout, and then let the next >> caller to pnfs_update_layout() immediately try again. >> >> My problem with this patch, is that it just falls back to doing I/O >> through the MDS for the writes that are already queued in >> pnfs_update_layout(). It perpetuates the current bad behaviour of >> unnecessary pounding of the server with LAYOUTGET requests that are >> going to fail with the exact same error. >> >> I'd therefore prefer to see us fix the real bug (i.e. the handling of >> NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating issues >> with the queuing. I already have 2 patches to deal with this. >> >> -- >> Trond Myklebust >> Linux NFS client maintainer, Hammerspace >> trond.myklebust@xxxxxxxxxxxxxxx >>
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature