Re: [PATCH v2 0/2] fuse: add timeout option for requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 31, 2024 at 7:47 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Thu, Aug 1, 2024 at 2:46 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > On Wed, Jul 31, 2024 at 10:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > >
> > > On Tue, Jul 30, 2024 at 7:14 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jul 31, 2024 at 2:16 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Jul 29, 2024 at 11:00 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Tue, Jul 30, 2024 at 8:28 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > There are situations where fuse servers can become unresponsive or take
> > > > > > > too long to reply to a request. Currently there is no upper bound on
> > > > > > > how long a request may take, which may be frustrating to users who get
> > > > > > > stuck waiting for a request to complete.
> > > > > > >
> > > > > > > This patchset adds a timeout option for requests and two dynamically
> > > > > > > configurable fuse sysctls "default_request_timeout" and "max_request_timeout"
> > > > > > > for controlling/enforcing timeout behavior system-wide.
> > > > > > >
> > > > > > > Existing fuse servers will not be affected unless they explicitly opt into the
> > > > > > > timeout.
> > > > > > >
> > > > > > > v1: https://lore.kernel.org/linux-fsdevel/20240717213458.1613347-1-joannelkoong@xxxxxxxxx/
> > > > > > > Changes from v1:
> > > > > > > - Add timeout for background requests
> > > > > > > - Handle resend race condition
> > > > > > > - Add sysctls
> > > > > > >
> > > > > > > Joanne Koong (2):
> > > > > > >   fuse: add optional kernel-enforced timeout for requests
> > > > > > >   fuse: add default_request_timeout and max_request_timeout sysctls
> > > > > > >
> > > > > > >  Documentation/admin-guide/sysctl/fs.rst |  17 +++
> > > > > > >  fs/fuse/Makefile                        |   2 +-
> > > > > > >  fs/fuse/dev.c                           | 187 +++++++++++++++++++++++-
> > > > > > >  fs/fuse/fuse_i.h                        |  30 ++++
> > > > > > >  fs/fuse/inode.c                         |  24 +++
> > > > > > >  fs/fuse/sysctl.c                        |  42 ++++++
> > > > > > >  6 files changed, 293 insertions(+), 9 deletions(-)
> > > > > > >  create mode 100644 fs/fuse/sysctl.c
> > > > > > >
> > > > > > > --
> > > > > > > 2.43.0
> > > > > > >
> > > > > >
> > > > > > Hello Joanne,
> > > > > >
> > > > > > Thanks for your update.
> > > > > >
> > > > > > I have tested your patches using my test case, which is similar to the
> > > > > > hello-fuse [0] example, with an additional change as follows:
> > > > > >
> > > > > > @@ -125,6 +125,8 @@ static int hello_read(const char *path, char *buf,
> > > > > > size_t size, off_t offset,
> > > > > >         } else
> > > > > >                 size = 0;
> > > > > >
> > > > > > +       // TO trigger timeout
> > > > > > +       sleep(60);
> > > > > >         return size;
> > > > > >  }
> > > > > >
> > > > > > [0] https://github.com/libfuse/libfuse/blob/master/example/hello.c
> > > > > >
> > > > > > However, it triggered a crash with the following setup:
> > > > > >
> > > > > > 1. Set FUSE timeout:
> > > > > >   sysctl -w fs.fuse.default_request_timeout=10
> > > > > >   sysctl -w fs.fuse.max_request_timeout = 20
> > > > > >
> > > > > > 2. Start FUSE daemon:
> > > > > >   ./hello /tmp/fuse
> > > > > >
> > > > > > 3. Read from FUSE:
> > > > > >   cat /tmp/fuse/hello
> > > > > >
> > > > > > 4. Kill the process within 10 seconds (to avoid the timeout being triggered).
> > > > > >    Then the crash will be triggered.
> > > > >
> > > > > Hi Yafang,
> > > > >
> > > > > Thanks for trying this out on your use case!
> > > > >
> > > > > How consistently are you able to repro this?
> > > >
> > > > It triggers the crash every time.
> > > >
> > > > > I tried reproing using
> > > > > your instructions above but I'm not able to get the crash.
> > > >
> > > > Please note that it is the `cat /tmp/fuse/hello` process that was
> > > > killed, not the fuse daemon.
> > > > The crash seems to occur when the fuse daemon wakes up after
> > > > sleep(60). Please ensure that the fuse daemon can be woken up.
> > > >
> > >
> > > I'm still not able to trigger the crash by killing the `cat
> > > /tmp/fuse/hello` process. This is how I'm repro-ing
> > >
> > > 1) Add sleep to test code in
> > > https://github.com/libfuse/libfuse/blob/master/example/hello.c
> > > @@ -125,6 +126,9 @@ static int hello_read(const char *path, char *buf,
> > > size_t size, off_t offset,
> > >         } else
> > >                 size = 0;
> > >
> > > +       sleep(60);
> > > +       printf("hello_read woke up from sleep\n");
> > > +
> > >         return size;
> > >  }
> > >
> > > 2)  Set fuse timeout to 10 seconds
> > > sysctl -w fs.fuse.default_request_timeout=10
> > >
> > > 3) Start fuse daemon
> > > ./example/hello ./tmp/fuse
> > >
> > > 4) Read from fuse
> > > cat /tmp/fuse/hello
> > >
> > > 5) Get pid of cat process
> > > top -b | grep cat
> > >
> > > 6) Kill cat process (within 10 seconds)
> > >  sudo kill -9 <cat-pid>
> > >
> > > 7) Wait 60 seconds for fuse's read request to complete
> > >
> > > From what it sounds like, this is exactly what you are doing as well?
> > >
> > > I added some kernel-side logs and I'm seeing that the read request is
> > > timing out after ~10 seconds and handled by the timeout handler
> > > successfully.
> > >
> > > On the fuse daemon side, these are the logs I'm seeing from the above repro:
> > > ./example/hello /tmp/fuse -f -d
> > >
> > > FUSE library version: 3.17.0
> > > nullpath_ok: 0
> > > unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0
> > > INIT: 7.40
> > > flags=0x73fffffb
> > > max_readahead=0x00020000
> > >    INIT: 7.40
> > >    flags=0x4040f039
> > >    max_readahead=0x00020000
> > >    max_write=0x00100000
> > >    max_background=0
> > >    congestion_threshold=0
> > >    time_gran=1
> > >    unique: 2, success, outsize: 80
> > > unique: 4, opcode: LOOKUP (1), nodeid: 1, insize: 46, pid: 673
> > > LOOKUP /hello
> > > getattr[NULL] /hello
> > >    NODEID: 2
> > >    unique: 4, success, outsize: 144
> > > unique: 6, opcode: OPEN (14), nodeid: 2, insize: 48, pid: 673
> > > open flags: 0x8000 /hello
> > >    open[0] flags: 0x8000 /hello
> > >    unique: 6, success, outsize: 32
> > > unique: 8, opcode: READ (15), nodeid: 2, insize: 80, pid: 673
> > > read[0] 4096 bytes from 0 flags: 0x8000
> > > unique: 10, opcode: FLUSH (25), nodeid: 2, insize: 64, pid: 673
> > >    unique: 10, error: -38 (Function not implemented), outsize: 16
> > > unique: 11, opcode: INTERRUPT (36), nodeid: 0, insize: 48, pid: 0
> > > FUSE_INTERRUPT: reply to kernel to disable interrupt
> > >    unique: 11, error: -38 (Function not implemented), outsize: 16
> > >
> > > unique: 12, opcode: RELEASE (18), nodeid: 2, insize: 64, pid: 0
> > >    unique: 12, success, outsize: 16
> > >
> > > hello_read woke up from sleep
> > >    read[0] 13 bytes from 0
> > >    unique: 8, success, outsize: 29
> > >
> > >
> > > Are these the debug logs you are seeing from the daemon side as well?
> > >
> > > Thanks,
> > > Joanne
> > > > >
> > > > > From the crash logs you provided below, it looks like what's happening
> > > > > is that if the process gets killed, the timer isn't getting deleted.
> >
> > When I looked at this log previously, I thought you were repro-ing by
> > killing the fuse daemon process, not the cat process. When we kill the
> > cat process, the timer shouldn't be getting deleted. (if the daemon
> > itself is killed, the timers get deleted)
> >
> > > > > I'll look more into what happens in fuse when a process is killed and
> > > > > get back to you on this.
> >
> > This is the flow of what is happening on the kernel side (verified by
> > local printks) -
> >
> > `cat /tmp/fuse/hello`:
> > Issues a FUSE_READ background request (via fuse_send_readpages(),
> > fm->fc->async_read). This request will have a timeout of 10 seconds on
> > it
> >
> > The cat process is killed:
> > This does not clean up the request. The request is still on the fpq
> > processing list.
> >
> > Timeout on request expires:
> > The timeout handler runs and properly cleans up / frees the request.
> >
> > Fuse daemon wakes from sleep and replies to the request:
> > In dev_do_write(), the kernel won't be able to find this request
> > (since it timed out and was removed from the fpq processing list) and
> > return with -ENOENT
>
> Thank you for your explanation.
> I will verify if there are any issues with my test environment.
>
Hi Yafang,

Would you mind adding these printks to your kernel when you run the
repro and pasting what they show?

--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -287,6 +287,9 @@ static void do_fuse_request_end(struct fuse_req
*req, bool from_timer_callback)
        struct fuse_conn *fc = fm->fc;
        struct fuse_iqueue *fiq = &fc->iq;

+       printk("do_fuse_request_end: req=%p, from_timer=%d,
req->timer.func=%d\n",
+              req, from_timer_callback, req->timer.function != NULL);
+
        if (from_timer_callback)
                req->out.h.error = -ETIME;

@@ -415,6 +418,8 @@ static void fuse_request_timeout(struct timer_list *timer)
 {
        struct fuse_req *req = container_of(timer, struct fuse_req, timer);

+       printk("fuse_request_timeout: req=%p\n", req);
+
        /*
         * Request reply is being finished by the kernel right now.
         * No need to time out the request.
@@ -612,6 +617,7 @@ ssize_t fuse_simple_request(struct fuse_mount *fm,
struct fuse_args *args)

        if (!args->noreply)
                __set_bit(FR_ISREPLY, &req->flags);
+       printk("fuse_simple_request: req=%p, op=%u\n", req, args->opcode);
        __fuse_request_send(req);
        ret = req->out.h.error;
        if (!ret && args->out_argvar) {
@@ -673,6 +679,7 @@ int fuse_simple_background(struct fuse_mount *fm,
struct fuse_args *args,

        fuse_args_to_req(req, args);

+       printk("fuse_background_request: req=%p, op=%u\n", req, args->opcode);
        if (!fuse_request_queue_background(req)) {
                fuse_put_request(req);


When I run it on my side, I see

[   68.117740] fuse_background_request: req=00000000874e2f14, op=26
[   68.131440] do_fuse_request_end: req=00000000874e2f14,
from_timer=0, req->timer.func=1
[   71.558538] fuse_simple_request: req=00000000cf643ace, op=1
[   71.559651] do_fuse_request_end: req=00000000cf643ace,
from_timer=0, req->timer.func=1
[   71.561044] fuse_simple_request: req=00000000f2c001f0, op=14
[   71.562524] do_fuse_request_end: req=00000000f2c001f0,
from_timer=0, req->timer.func=1
[   71.563820] fuse_background_request: req=00000000584f2cc3, op=15
[   78.580035] fuse_simple_request: req=00000000ecbee970, op=25
[   78.582614] do_fuse_request_end: req=00000000ecbee970,
from_timer=0, req->timer.func=1
[   81.624722] fuse_request_timeout: req=00000000584f2cc3
[   81.625443] do_fuse_request_end: req=00000000584f2cc3,
from_timer=1, req->timer.func=1
[   81.626377] fuse_background_request: req=00000000b2d792ed, op=18
[   81.627623] do_fuse_request_end: req=00000000b2d792ed,
from_timer=0, req->timer.func=1

I'm seeing only one timer get called, on the read request (opcode=15),
and I'm not seeing do_fuse_request_end having been called on that
request before the timer is invoked.
I'm curious to compare this against the logs on your end.

Thanks!!

> --
> Regards
> Yafang





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux