On Wed, Jul 31, 2024 at 10:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Tue, Jul 30, 2024 at 7:14 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > On Wed, Jul 31, 2024 at 2:16 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > > > On Mon, Jul 29, 2024 at 11:00 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > On Tue, Jul 30, 2024 at 8:28 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > > > > > > > There are situations where fuse servers can become unresponsive or take > > > > > too long to reply to a request. Currently there is no upper bound on > > > > > how long a request may take, which may be frustrating to users who get > > > > > stuck waiting for a request to complete. > > > > > > > > > > This patchset adds a timeout option for requests and two dynamically > > > > > configurable fuse sysctls "default_request_timeout" and "max_request_timeout" > > > > > for controlling/enforcing timeout behavior system-wide. > > > > > > > > > > Existing fuse servers will not be affected unless they explicitly opt into the > > > > > timeout. > > > > > > > > > > v1: https://lore.kernel.org/linux-fsdevel/20240717213458.1613347-1-joannelkoong@xxxxxxxxx/ > > > > > Changes from v1: > > > > > - Add timeout for background requests > > > > > - Handle resend race condition > > > > > - Add sysctls > > > > > > > > > > Joanne Koong (2): > > > > > fuse: add optional kernel-enforced timeout for requests > > > > > fuse: add default_request_timeout and max_request_timeout sysctls > > > > > > > > > > Documentation/admin-guide/sysctl/fs.rst | 17 +++ > > > > > fs/fuse/Makefile | 2 +- > > > > > fs/fuse/dev.c | 187 +++++++++++++++++++++++- > > > > > fs/fuse/fuse_i.h | 30 ++++ > > > > > fs/fuse/inode.c | 24 +++ > > > > > fs/fuse/sysctl.c | 42 ++++++ > > > > > 6 files changed, 293 insertions(+), 9 deletions(-) > > > > > create mode 100644 fs/fuse/sysctl.c > > > > > > > > > > -- > > > > > 2.43.0 > > > > > > > > > > > > > Hello Joanne, > > > > > > > > Thanks for your update. > > > > > > > > I have tested your patches using my test case, which is similar to the > > > > hello-fuse [0] example, with an additional change as follows: > > > > > > > > @@ -125,6 +125,8 @@ static int hello_read(const char *path, char *buf, > > > > size_t size, off_t offset, > > > > } else > > > > size = 0; > > > > > > > > + // TO trigger timeout > > > > + sleep(60); > > > > return size; > > > > } > > > > > > > > [0] https://github.com/libfuse/libfuse/blob/master/example/hello.c > > > > > > > > However, it triggered a crash with the following setup: > > > > > > > > 1. Set FUSE timeout: > > > > sysctl -w fs.fuse.default_request_timeout=10 > > > > sysctl -w fs.fuse.max_request_timeout = 20 > > > > > > > > 2. Start FUSE daemon: > > > > ./hello /tmp/fuse > > > > > > > > 3. Read from FUSE: > > > > cat /tmp/fuse/hello > > > > > > > > 4. Kill the process within 10 seconds (to avoid the timeout being triggered). > > > > Then the crash will be triggered. > > > > > > Hi Yafang, > > > > > > Thanks for trying this out on your use case! > > > > > > How consistently are you able to repro this? > > > > It triggers the crash every time. > > > > > I tried reproing using > > > your instructions above but I'm not able to get the crash. > > > > Please note that it is the `cat /tmp/fuse/hello` process that was > > killed, not the fuse daemon. > > The crash seems to occur when the fuse daemon wakes up after > > sleep(60). Please ensure that the fuse daemon can be woken up. > > > > I'm still not able to trigger the crash by killing the `cat > /tmp/fuse/hello` process. This is how I'm repro-ing > > 1) Add sleep to test code in > https://github.com/libfuse/libfuse/blob/master/example/hello.c > @@ -125,6 +126,9 @@ static int hello_read(const char *path, char *buf, > size_t size, off_t offset, > } else > size = 0; > > + sleep(60); > + printf("hello_read woke up from sleep\n"); > + > return size; > } > > 2) Set fuse timeout to 10 seconds > sysctl -w fs.fuse.default_request_timeout=10 > > 3) Start fuse daemon > ./example/hello ./tmp/fuse > > 4) Read from fuse > cat /tmp/fuse/hello > > 5) Get pid of cat process > top -b | grep cat > > 6) Kill cat process (within 10 seconds) > sudo kill -9 <cat-pid> > > 7) Wait 60 seconds for fuse's read request to complete > > From what it sounds like, this is exactly what you are doing as well? > > I added some kernel-side logs and I'm seeing that the read request is > timing out after ~10 seconds and handled by the timeout handler > successfully. > > On the fuse daemon side, these are the logs I'm seeing from the above repro: > ./example/hello /tmp/fuse -f -d > > FUSE library version: 3.17.0 > nullpath_ok: 0 > unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0 > INIT: 7.40 > flags=0x73fffffb > max_readahead=0x00020000 > INIT: 7.40 > flags=0x4040f039 > max_readahead=0x00020000 > max_write=0x00100000 > max_background=0 > congestion_threshold=0 > time_gran=1 > unique: 2, success, outsize: 80 > unique: 4, opcode: LOOKUP (1), nodeid: 1, insize: 46, pid: 673 > LOOKUP /hello > getattr[NULL] /hello > NODEID: 2 > unique: 4, success, outsize: 144 > unique: 6, opcode: OPEN (14), nodeid: 2, insize: 48, pid: 673 > open flags: 0x8000 /hello > open[0] flags: 0x8000 /hello > unique: 6, success, outsize: 32 > unique: 8, opcode: READ (15), nodeid: 2, insize: 80, pid: 673 > read[0] 4096 bytes from 0 flags: 0x8000 > unique: 10, opcode: FLUSH (25), nodeid: 2, insize: 64, pid: 673 > unique: 10, error: -38 (Function not implemented), outsize: 16 > unique: 11, opcode: INTERRUPT (36), nodeid: 0, insize: 48, pid: 0 > FUSE_INTERRUPT: reply to kernel to disable interrupt > unique: 11, error: -38 (Function not implemented), outsize: 16 > > unique: 12, opcode: RELEASE (18), nodeid: 2, insize: 64, pid: 0 > unique: 12, success, outsize: 16 > > hello_read woke up from sleep > read[0] 13 bytes from 0 > unique: 8, success, outsize: 29 > > > Are these the debug logs you are seeing from the daemon side as well? > > Thanks, > Joanne > > > > > > From the crash logs you provided below, it looks like what's happening > > > is that if the process gets killed, the timer isn't getting deleted. When I looked at this log previously, I thought you were repro-ing by killing the fuse daemon process, not the cat process. When we kill the cat process, the timer shouldn't be getting deleted. (if the daemon itself is killed, the timers get deleted) > > > I'll look more into what happens in fuse when a process is killed and > > > get back to you on this. This is the flow of what is happening on the kernel side (verified by local printks) - `cat /tmp/fuse/hello`: Issues a FUSE_READ background request (via fuse_send_readpages(), fm->fc->async_read). This request will have a timeout of 10 seconds on it The cat process is killed: This does not clean up the request. The request is still on the fpq processing list. Timeout on request expires: The timeout handler runs and properly cleans up / frees the request. Fuse daemon wakes from sleep and replies to the request: In dev_do_write(), the kernel won't be able to find this request (since it timed out and was removed from the fpq processing list) and return with -ENOENT > > > > Thanks > > > > -- > > Regards > > Yafang