Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

"Mkrtchyan, Tigran" <tigran.mkrtchyan@xxxxxxx> · Wed, 1 Jun 2022 09:27:47 +0200 (CEST)

Hi Olga,

----- Original Message -----
> From: "Olga Kornievskaia" <olga.kornievskaia@xxxxxxxxx>
> To: "trondmy" <trondmy@xxxxxxxxxxxxxxx>
> Cc: "Anna Schumaker" <anna.schumaker@xxxxxxxxxx>, "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx>
> Sent: Tuesday, 31 May, 2022 18:03:34
> Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

> On Tue, May 31, 2022 at 11:00 AM Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>
>> On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote:
>> > From: Olga Kornievskaia <kolga@xxxxxxxxxx>
>> >
>> > In recent pnfs testing we've incountered IO thread starvation problem
>> > during the time when the server returns LAYOUTUNAVAILABLE error to
>> > the
>> > client. When that happens each IO request tries to get a new layout
>> > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET
>> > RPC is outstanding, the rest would be waiting. As the thread that
>> > gets
>> > the layout wakes up the waiters only one gets to run and it tends to
>> > be
>> > the latest added to the waiting queue. After receiving
>> > LAYOUTUNAVAILABLE
>> > error the client would fall back to the MDS writes and as those
>> > writes
>> > complete and the new write is issued, those requests are added as
>> > waiters and they get to run before the earliest of the waiters that
>> > was put on the queue originally never gets to run until the
>> > LAYOUTUNAVAILABLE condition resolves itself on the server.
>> >
>> > With the current code, if N IOs arrive asking for a layout, then
>> > there will be N serial LAYOUTGETs that will follow, each would be
>> > getting its own LAYOUTUNAVAILABLE error. Instead, the patch proposes
>> > to apply the error condition to ALL the waiters for the outstanding
>> > LAYOUTGET. Once the error is received, the code would allow all
>> > exiting N IOs fall back to the MDS, but any new arriving IOs would be
>> > then queued up and one them the new IO would trigger a new LAYOUTGET.
>> >
>> > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx>
>> > ---
>> >  fs/nfs/pnfs.c | 14 +++++++++++++-
>> >  fs/nfs/pnfs.h |  2 ++
>> >  2 files changed, 15 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> > index 68a87be3e6f9..5b7a679e32c8 100644
>> > --- a/fs/nfs/pnfs.c
>> > +++ b/fs/nfs/pnfs.c
>> > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino,
>> >         if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo))
>> > &&
>> >             atomic_read(&lo->plh_outstanding) != 0) {
>> >                 spin_unlock(&ino->i_lock);
>> > +               atomic_inc(&lo->plh_waiting);
>> >                 lseg = ERR_PTR(wait_var_event_killable(&lo-
>> > >plh_outstanding,
>> >                                         !atomic_read(&lo-
>> > >plh_outstanding)));
>> > -               if (IS_ERR(lseg))
>> > +               if (IS_ERR(lseg)) {
>> > +                       atomic_dec(&lo->plh_waiting);
>> >                         goto out_put_layout_hdr;
>> > +               }
>> > +               if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) {
>> > +                       pnfs_layout_clear_fail_bit(lo,
>> > pnfs_iomode_to_fail_bit(iomode));
>> > +                       lseg = NULL;
>> > +                       if (atomic_dec_and_test(&lo->plh_waiting))
>> > +                               clear_bit(NFS_LAYOUT_DRAIN, &lo-
>> > >plh_flags);
>> > +                       goto out_put_layout_hdr;
>> > +               }
>> >                 pnfs_put_layout_hdr(lo);
>> >                 goto lookup_again;
>> >         }
>> > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino,
>> >                 case -ERECALLCONFLICT:
>> >                 case -EAGAIN:
>> >                         break;
>> > +               case -ENODATA:
>> > +                       set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags);
>> >                 default:
>> >                         if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
>> >                                 pnfs_layout_clear_fail_bit(lo,
>> > pnfs_iomode_to_fail_bit(iomode));
>> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> > index 07f11489e4e9..5c07da32320b 100644
>> > --- a/fs/nfs/pnfs.h
>> > +++ b/fs/nfs/pnfs.h
>> > @@ -105,6 +105,7 @@ enum {
>> >         NFS_LAYOUT_FIRST_LAYOUTGET,     /* Serialize first layoutget
>> > */
>> >         NFS_LAYOUT_INODE_FREEING,       /* The inode is being freed
>> > */
>> >         NFS_LAYOUT_HASHED,              /* The layout visible */
>> > +       NFS_LAYOUT_DRAIN,
>> >  };
>> >
>> >  enum layoutdriver_policy_flags {
>> > @@ -196,6 +197,7 @@ struct pnfs_commit_ops {
>> >  struct pnfs_layout_hdr {
>> >         refcount_t              plh_refcount;
>> >         atomic_t                plh_outstanding; /* number of RPCs
>> > out */
>> > +       atomic_t                plh_waiting;
>> >         struct list_head        plh_layouts;   /* other client
>> > layouts */
>> >         struct list_head        plh_bulk_destroy;
>> >         struct list_head        plh_segs;      /* layout segments
>> > list */
>>
>> According to the spec, the correct behaviour for handling
>> NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the inode,
>> and to fall back to doing I/O through the MDS. The error describes a
>> more or less permanent state of the server being unable to hand out a
>> layout for this file.
>> If the server wanted the clients to retry after a delay, it should be
>> returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly.
> 
> To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn off
> the inode permanently but for a period of time?
> 
> It looks to me that for the LAYOUTTRYLATER, the client would face the
> same starvation problem in the same situation. I don't see anything
> marking the segment failed for such error? I believe the client
> returns nolayout for that error, falls back to MDS but allows asking
> for the layout for a period of time, having again the queue of waiters

Your assumption doesn't match to our observation. For files that off-line
(DS down or file is on tape) we return LAYOUTTRYLATER. Usually, client keep
re-trying LAYOUTGET until file is available again. We use flexfile layout
as nfs4_file has less predictable behavior. For files that should be served
by MDS we return LAYOUTUNAVAILABLE. Typically, those files are quite small
and served with a single READ request, so I haven't observe repetitive LAYOUTGET
calls.

Best regards,
   Tigran. 

> that gets manipulated as such that favors last added.
> 
> 
>> Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE is
>> just plain wrong: we just return no layout, and then let the next
>> caller to pnfs_update_layout() immediately try again.
>>
>> My problem with this patch, is that it just falls back to doing I/O
>> through the MDS for the writes that are already queued in
>> pnfs_update_layout(). It perpetuates the current bad behaviour of
>> unnecessary pounding of the server with LAYOUTGET requests that are
>> going to fail with the exact same error.
>>
>> I'd therefore prefer to see us fix the real bug (i.e. the handling of
>> NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating issues
>> with the queuing. I already have 2 patches to deal with this.
>>
>> --
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> trond.myklebust@xxxxxxxxxxxxxxx
>>
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature