Re: [PATCH] fuse: add optional kernel-enforced timeout for fuse requests

Josef Bacik <josef@xxxxxxxxxxxxxx> · Fri, 19 Jul 2024 09:06:49 -0400

On Thu, Jul 18, 2024 at 05:37:06PM -0700, Joanne Koong wrote:
> On Thu, Jul 18, 2024 at 3:11 PM Bernd Schubert
> <bs_lists@xxxxxxxxxxxxxxxxx> wrote:
> >
> >
> >
> > On 7/18/24 07:24, Joanne Koong wrote:
> > > On Wed, Jul 17, 2024 at 3:23 PM Bernd Schubert
> > > <bernd.schubert@xxxxxxxxxxx> wrote:
> > >>
> > >> Hi Joanne,
> > >>
> > >> On 7/17/24 23:34, Joanne Koong wrote:
> > >>> There are situations where fuse servers can become unresponsive or take
> > >>> too long to reply to a request. Currently there is no upper bound on
> > >>> how long a request may take, which may be frustrating to users who get
> > >>> stuck waiting for a request to complete.
> > >>>
> > >>> This commit adds a daemon timeout option (in seconds) for fuse requests.
> > >>> If the timeout elapses before the request is replied to, the request will
> > >>> fail with -ETIME.
> > >>>
> > >>> There are 3 possibilities for a request that times out:
> > >>> a) The request times out before the request has been sent to userspace
> > >>> b) The request times out after the request has been sent to userspace
> > >>> and before it receives a reply from the server
> > >>> c) The request times out after the request has been sent to userspace
> > >>> and the server replies while the kernel is timing out the request
> > >>>
> > >>> Proper synchronization must be added to ensure that the request is
> > >>> handled correctly in all of these cases. To this effect, there is a new
> > >>> FR_PROCESSING bit added to the request flags, which is set atomically by
> > >>> either the timeout handler (see fuse_request_timeout()) which is invoked
> > >>> after the request timeout elapses or set by the request reply handler
> > >>> (see dev_do_write()), whichever gets there first.
> > >>>
> > >>> If the reply handler and the timeout handler are executing simultaneously
> > >>> and the reply handler sets FR_PROCESSING before the timeout handler, then
> > >>> the request is re-queued onto the waitqueue and the kernel will process the
> > >>> reply as though the timeout did not elapse. If the timeout handler sets
> > >>> FR_PROCESSING before the reply handler, then the request will fail with
> > >>> -ETIME and the request will be cleaned up.
> > >>>
> > >>> Proper acquires on the request reference must be added to ensure that the
> > >>> timeout handler does not drop the last refcount on the request while the
> > >>> reply handler (dev_do_write()) or forwarder handler (dev_do_read()) is
> > >>> still accessing the request. (By "forwarder handler", this is the handler
> > >>> that forwards the request to userspace).
> > >>>
> > >>> Currently, this is the lifecycle of the request refcount:
> > >>>
> > >>> Request is created:
> > >>> fuse_simple_request -> allocates request, sets refcount to 1
> > >>>   __fuse_request_send -> acquires refcount
> > >>>     queues request and waits for reply...
> > >>> fuse_simple_request -> drops refcount
> > >>>
> > >>> Request is freed:
> > >>> fuse_dev_do_write
> > >>>   fuse_request_end -> drops refcount on request
> > >>>
> > >>> The timeout handler drops the refcount on the request so that the
> > >>> request is properly cleaned up if a reply is never received. Because of
> > >>> this, both the forwarder handler and the reply handler must acquire a refcount
> > >>> on the request while it accesses the request, and the refcount must be
> > >>> acquired while the lock of the list the request is on is held.
> > >>>
> > >>> There is a potential race if the request is being forwarded to
> > >>> userspace while the timeout handler is executing (eg FR_PENDING has
> > >>> already been cleared but dev_do_read() hasn't finished executing). This
> > >>> is a problem because this would free the request but the request has not
> > >>> been removed from the fpq list it's on. To prevent this, dev_do_read()
> > >>> must check FR_PROCESSING at the end of its logic and remove the request
> > >>> from the fpq list if the timeout occurred.
> > >>>
> > >>> There is also the case where the connection may be aborted or the
> > >>> device may be released while the timeout handler is running. To protect
> > >>> against an extra refcount drop on the request, the timeout handler
> > >>> checks the connected state of the list and lets the abort handler drop the
> > >>> last reference if the abort is running simultaneously. Similarly, the
> > >>> timeout handler also needs to check if the req->out.h.error is set to
> > >>> -ESTALE, which indicates that the device release is cleaning up the
> > >>> request. In both these cases, the timeout handler will return without
> > >>> dropping the refcount.
> > >>>
> > >>> Please also note that background requests are not applicable for timeouts
> > >>> since they are asynchronous.
> > >>
> > >>
> > >> This and that thread here actually make me wonder if this is the right
> > >> approach
> > >>
> > >> https://lore.kernel.org/lkml/20240613040147.329220-1-haifeng.xu@xxxxxxxxxx/T/
> > >>
> > >>
> > >> In  th3 thread above a request got interrupted, but fuse-server still
> > >> does not manage stop it. From my point of view, interrupting a request
> > >> suggests to add a rather short kernel lifetime for it. With that one
> > >
> > > Hi Bernd,
> > >
> > > I believe this solution fixes the problem outlined in that thread
> > > (namely, that the process gets stuck waiting for a reply). If the
> > > request is interrupted before it times out, the kernel will wait with
> > > a timeout again on the request (timeout would start over, but the
> > > request will still eventually sooner or later time out). I'm not sure
> > > I agree that we want to cancel the request altogether if it's
> > > interrupted. For example, if the user uses the user-defined signal
> > > SIGUSR1, it can be unexpected and arbitrary behavior for the request
> > > to be aborted by the kernel. I also don't think this can be consistent
> > > for what the fuse server will see since some requests may have already
> > > been forwarded to userspace when the request is aborted and some
> > > requests may not have.
> > >
> > > I think if we were to enforce that the request should be aborted when
> > > it's interrupted regardless of whether a timeout is specified or not,
> > > then we should do it similarly to how the timeout handler logic
> > > handles it in this patch,rather than the implementation in the thread
> > > linked above (namely, that the request should be explicitly cleaned up
> > > immediately instead of when the interrupt request sends a reply); I
> > > don't believe the implementation in the link handles the case where
> > > for example the fuse server is in a deadlock and does not reply to the
> > > interrupt request. Also, as I understand it, it is optional for
> > > servers to reply or not to the interrupt request.
> >
> > Hi Joanne,
> >
> > yeah, the solution in the link above is definitely not ideal and I think
> > a timout based solution would be better. But I think your patch wouldn't
> > work either right now, unless server side sets a request timeout.
> > Btw, I would rename the variable 'daemon_timeout' to somethink like
> > req_timeout.
> >
> Hi Bernd,
> 
> I think we need to figure out if we indeed want the kernel to abort
> interrupted requests if no request timeout was explicitly set by the
> server. I'm leaning towards no, for the reasons in my previous reply;
> in addition to that I'm also not sure if we would be potentially
> breaking existing filesystems if we introduced this new behavior.
> Curious to hear your and others' thoughts on this.
> 
> (Btw, if we did want to add this in, i think the change would be
> actually pretty simple. We could pretty much just reuse all the logic
> that's implemented in the timeout handling code. It's very much the
> same scenario (request getting aborted and needing to protect against
> races with different handlers))
> 
> I will rename daemon_timeout to req_timeout in v2. Thanks for the suggestion.
> 
> > >
> > >> either needs to wake up in intervals and check if request timeout got
> > >> exceeded or it needs to be an async kernel thread. I think that async
> > >> thread would also allow to give a timeout to background requests.
> > >
> > > in my opinion, background requests do not need timeouts. As I
> > > understand it, background requests are used only for direct i/o async
> > > read/writes, writing back dirty pages,and readahead requests generated
> > > by the kernel. I don't think fuse servers would have a need for timing
> > > out background requests.
> >
> > There is another discussion here, where timeouts are a possible although
> > ugly solution to avoid page copies
> >
> > https://lore.kernel.org/linux-kernel/233a9fdf-13ea-488b-a593-5566fc9f5d92@xxxxxxxxxxx/T/
> >
> Thanks for the link, it's an interesting read.
> 
> >
> > That is the bg writeback code path.
> >
> > >
> > >>
> > >> Or we add an async timeout to bg and interupted requests additionally?
> > >
> > > The interrupted request will already have a timeout on it since it
> > > waits with a timeout again for the reply after it's interrupted.
> >
> > If daemon side configures timeouts. And interrupted requests might want
> > to have a different timeout. I will check when I'm back if we can update
> > your patch a bit for that.
> >
> > Your patch hooks in quite nicely and basically without overhead into fg
> > (sync) requests. Timing out bg requests will have a bit overhead (unless
> > I miss something), so maybe we need two solutions here. And if we want
> > to go that route at all, to avoid these extra fuse page copies.
> >
> Agreed, I think if we do decide to go down this route, it seems
> cleaner to me to have the background request timeouts handled
> separately. Maybe something like having a timer per batch (where
> "batch" is the group of requests that get flushed at the same time)?
> That seems to me like the approach with the least overhead.
> 

I don't want to have a bunch of different timeouts, we should just have one and
have consistent behavior across all classes of requests.

I think the only thing we should have that is "separate" is a way to set request
timeouts that aren't set by the daemon itself.  Administrators should be able to
set a per-mount timeout via sysfs/algo in order to have some sort of control
over possibly malicious FUSE file systems.

But that should just tie into whatever mechanism you come up with, and
everything should all behave consistently with that timeout.  Thanks,

Josef