Re: handle_async_copy calling kzalloc under spinlock

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Fri, 16 Nov 2018 14:58:44 -0500

On Fri, Nov 16, 2018 at 02:49:00PM -0500, Olga Kornievskaia wrote:
> On Fri, Nov 16, 2018 at 2:30 PM J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> >
> > On Fri, Nov 16, 2018 at 01:52:29PM -0500, Olga Kornievskaia wrote:
> > > On Fri, Nov 16, 2018 at 1:30 PM Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> > > > Then how does the copy knows not to go wait for the callback? Copy
> > > > checks the pending_callback list to see if received a callback. If
> > > > not, it puts itself on the copy list and goes to sleep. The callback,
> > > > checks the copy list and if it finds a copy signals it, if not it puts
> > > > itself on the pending_callback list. a lock is held over checking one
> > > > list and putting yourself on the other.
> >
> > OK, apologies, I don't really understand those data structures yet, but
> > something seems wrong to me.
> >
> > Under what circumstances could we recieve a CB_OFFLOAD without having
> > started the corresponding copy already?
> 
> It can receive a CB_OFFLOAD before it receives a reply to the COPY.
> It's possible and I can trigger it during testing when doing a really
> short copy. The copy is done and callback thread sends a reply.
> CB_OFFLOAD call and COPY reply can be switched on the server or on the
> processing on the client.

That race is discussed in
https://tools.ietf.org/html/rfc5661#section-2.10.6.3 and is supposed to
be dealt with by using referring triples and/or returning DELAY.

> > And shouldn't CB_OFFLOAD be returning bad_stateid in the case it doesn't
> > recognize the given stateid?
> 
> It could but what should the server do in this case. I would imagine
> it wouldn't do anything. There is nothing it can do. So now we have a
> copy that send the call and is going to wait on the reply which will
> never come as the 1st one came and we rejected it and now copy will
> wait forever.
> 
> Please describe what "is wrong" with the current implementation.  I
> believe it provide a reasonable solution to the race condition.

Looks like a server that sends bad stateids in callbacks could cause you
to allocate something that will never get freed.

--b.

> > It looks like the allocation failure is
> > the *only* way we'll return an error on CB_OFFLOAD, and that seems
> > wrong.
> 
> Yes it is the only error we currently return. Do you see any other
> useful errors that a client should return (and would be useful to
> handle on the server). I don't see any need for any more
> complications.
> 
> > > > > I also wonder if SERVERFAULT is really the best error for a memory
> > > > > allocation failure there.
> > > >
> > > > I guess EIO or ENOMEM might be better. But I don't think this error
> > > > gets returned anywhere to the main process.
> > > >
> > >
> > > Wait. It is returning SERVERFAULT because it's the callback server replying
> > > back to the server's CB_RECALL call and I believe SERVERFAULT is the
> > > appropriate error here. NFS doesn't have ENOMEM error.
> >
> > We could return DELAY if we think it might be worth the server trying
> > the CB_RECALL again.  (That's what nfsd usually returns on allocation
> > failures.  I don't know if that's really ideal.)
> 
> If the client had any smarts to say correct this error that would be
> useful to return but this is not the case. I don't believe there is a