Re: handle_async_copy calling kzalloc under spinlock

Olga Kornievskaia <aglo@xxxxxxxxx> · Fri, 16 Nov 2018 14:49:00 -0500

On Fri, Nov 16, 2018 at 2:30 PM J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>
> On Fri, Nov 16, 2018 at 01:52:29PM -0500, Olga Kornievskaia wrote:
> > On Fri, Nov 16, 2018 at 1:30 PM Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> > > Then how does the copy knows not to go wait for the callback? Copy
> > > checks the pending_callback list to see if received a callback. If
> > > not, it puts itself on the copy list and goes to sleep. The callback,
> > > checks the copy list and if it finds a copy signals it, if not it puts
> > > itself on the pending_callback list. a lock is held over checking one
> > > list and putting yourself on the other.
>
> OK, apologies, I don't really understand those data structures yet, but
> something seems wrong to me.
>
> Under what circumstances could we recieve a CB_OFFLOAD without having
> started the corresponding copy already?

It can receive a CB_OFFLOAD before it receives a reply to the COPY.
It's possible and I can trigger it during testing when doing a really
short copy. The copy is done and callback thread sends a reply.
CB_OFFLOAD call and COPY reply can be switched on the server or on the
processing on the client.

>
> And shouldn't CB_OFFLOAD be returning bad_stateid in the case it doesn't
> recognize the given stateid?

It could but what should the server do in this case. I would imagine
it wouldn't do anything. There is nothing it can do. So now we have a
copy that send the call and is going to wait on the reply which will
never come as the 1st one came and we rejected it and now copy will
wait forever.

Please describe what "is wrong" with the current implementation.  I
believe it provide a reasonable solution to the race condition.

> It looks like the allocation failure is
> the *only* way we'll return an error on CB_OFFLOAD, and that seems
> wrong.

Yes it is the only error we currently return. Do you see any other
useful errors that a client should return (and would be useful to
handle on the server). I don't see any need for any more
complications.

> > > > I also wonder if SERVERFAULT is really the best error for a memory
> > > > allocation failure there.
> > >
> > > I guess EIO or ENOMEM might be better. But I don't think this error
> > > gets returned anywhere to the main process.
> > >
> >
> > Wait. It is returning SERVERFAULT because it's the callback server replying
> > back to the server's CB_RECALL call and I believe SERVERFAULT is the
> > appropriate error here. NFS doesn't have ENOMEM error.
>
> We could return DELAY if we think it might be worth the server trying
> the CB_RECALL again.  (That's what nfsd usually returns on allocation
> failures.  I don't know if that's really ideal.)

If the client had any smarts to say correct this error that would be
useful to return but this is not the case. I don't believe there is a