On Thu, Aug 1, 2024 at 2:46 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Wed, Jul 31, 2024 at 10:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > On Tue, Jul 30, 2024 at 7:14 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > On Wed, Jul 31, 2024 at 2:16 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > > > > > On Mon, Jul 29, 2024 at 11:00 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > > > On Tue, Jul 30, 2024 at 8:28 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > > > > > > > > > There are situations where fuse servers can become unresponsive or take > > > > > > too long to reply to a request. Currently there is no upper bound on > > > > > > how long a request may take, which may be frustrating to users who get > > > > > > stuck waiting for a request to complete. > > > > > > > > > > > > This patchset adds a timeout option for requests and two dynamically > > > > > > configurable fuse sysctls "default_request_timeout" and "max_request_timeout" > > > > > > for controlling/enforcing timeout behavior system-wide. > > > > > > > > > > > > Existing fuse servers will not be affected unless they explicitly opt into the > > > > > > timeout. > > > > > > > > > > > > v1: https://lore.kernel.org/linux-fsdevel/20240717213458.1613347-1-joannelkoong@xxxxxxxxx/ > > > > > > Changes from v1: > > > > > > - Add timeout for background requests > > > > > > - Handle resend race condition > > > > > > - Add sysctls > > > > > > > > > > > > Joanne Koong (2): > > > > > > fuse: add optional kernel-enforced timeout for requests > > > > > > fuse: add default_request_timeout and max_request_timeout sysctls > > > > > > > > > > > > Documentation/admin-guide/sysctl/fs.rst | 17 +++ > > > > > > fs/fuse/Makefile | 2 +- > > > > > > fs/fuse/dev.c | 187 +++++++++++++++++++++++- > > > > > > fs/fuse/fuse_i.h | 30 ++++ > > > > > > fs/fuse/inode.c | 24 +++ > > > > > > fs/fuse/sysctl.c | 42 ++++++ > > > > > > 6 files changed, 293 insertions(+), 9 deletions(-) > > > > > > create mode 100644 fs/fuse/sysctl.c > > > > > > > > > > > > -- > > > > > > 2.43.0 > > > > > > > > > > > > > > > > Hello Joanne, > > > > > > > > > > Thanks for your update. > > > > > > > > > > I have tested your patches using my test case, which is similar to the > > > > > hello-fuse [0] example, with an additional change as follows: > > > > > > > > > > @@ -125,6 +125,8 @@ static int hello_read(const char *path, char *buf, > > > > > size_t size, off_t offset, > > > > > } else > > > > > size = 0; > > > > > > > > > > + // TO trigger timeout > > > > > + sleep(60); > > > > > return size; > > > > > } > > > > > > > > > > [0] https://github.com/libfuse/libfuse/blob/master/example/hello.c > > > > > > > > > > However, it triggered a crash with the following setup: > > > > > > > > > > 1. Set FUSE timeout: > > > > > sysctl -w fs.fuse.default_request_timeout=10 > > > > > sysctl -w fs.fuse.max_request_timeout = 20 > > > > > > > > > > 2. Start FUSE daemon: > > > > > ./hello /tmp/fuse > > > > > > > > > > 3. Read from FUSE: > > > > > cat /tmp/fuse/hello > > > > > > > > > > 4. Kill the process within 10 seconds (to avoid the timeout being triggered). > > > > > Then the crash will be triggered. > > > > > > > > Hi Yafang, > > > > > > > > Thanks for trying this out on your use case! > > > > > > > > How consistently are you able to repro this? > > > > > > It triggers the crash every time. > > > > > > > I tried reproing using > > > > your instructions above but I'm not able to get the crash. > > > > > > Please note that it is the `cat /tmp/fuse/hello` process that was > > > killed, not the fuse daemon. > > > The crash seems to occur when the fuse daemon wakes up after > > > sleep(60). Please ensure that the fuse daemon can be woken up. > > > > > > > I'm still not able to trigger the crash by killing the `cat > > /tmp/fuse/hello` process. This is how I'm repro-ing > > > > 1) Add sleep to test code in > > https://github.com/libfuse/libfuse/blob/master/example/hello.c > > @@ -125,6 +126,9 @@ static int hello_read(const char *path, char *buf, > > size_t size, off_t offset, > > } else > > size = 0; > > > > + sleep(60); > > + printf("hello_read woke up from sleep\n"); > > + > > return size; > > } > > > > 2) Set fuse timeout to 10 seconds > > sysctl -w fs.fuse.default_request_timeout=10 > > > > 3) Start fuse daemon > > ./example/hello ./tmp/fuse > > > > 4) Read from fuse > > cat /tmp/fuse/hello > > > > 5) Get pid of cat process > > top -b | grep cat > > > > 6) Kill cat process (within 10 seconds) > > sudo kill -9 <cat-pid> > > > > 7) Wait 60 seconds for fuse's read request to complete > > > > From what it sounds like, this is exactly what you are doing as well? > > > > I added some kernel-side logs and I'm seeing that the read request is > > timing out after ~10 seconds and handled by the timeout handler > > successfully. > > > > On the fuse daemon side, these are the logs I'm seeing from the above repro: > > ./example/hello /tmp/fuse -f -d > > > > FUSE library version: 3.17.0 > > nullpath_ok: 0 > > unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0 > > INIT: 7.40 > > flags=0x73fffffb > > max_readahead=0x00020000 > > INIT: 7.40 > > flags=0x4040f039 > > max_readahead=0x00020000 > > max_write=0x00100000 > > max_background=0 > > congestion_threshold=0 > > time_gran=1 > > unique: 2, success, outsize: 80 > > unique: 4, opcode: LOOKUP (1), nodeid: 1, insize: 46, pid: 673 > > LOOKUP /hello > > getattr[NULL] /hello > > NODEID: 2 > > unique: 4, success, outsize: 144 > > unique: 6, opcode: OPEN (14), nodeid: 2, insize: 48, pid: 673 > > open flags: 0x8000 /hello > > open[0] flags: 0x8000 /hello > > unique: 6, success, outsize: 32 > > unique: 8, opcode: READ (15), nodeid: 2, insize: 80, pid: 673 > > read[0] 4096 bytes from 0 flags: 0x8000 > > unique: 10, opcode: FLUSH (25), nodeid: 2, insize: 64, pid: 673 > > unique: 10, error: -38 (Function not implemented), outsize: 16 > > unique: 11, opcode: INTERRUPT (36), nodeid: 0, insize: 48, pid: 0 > > FUSE_INTERRUPT: reply to kernel to disable interrupt > > unique: 11, error: -38 (Function not implemented), outsize: 16 > > > > unique: 12, opcode: RELEASE (18), nodeid: 2, insize: 64, pid: 0 > > unique: 12, success, outsize: 16 > > > > hello_read woke up from sleep > > read[0] 13 bytes from 0 > > unique: 8, success, outsize: 29 > > > > > > Are these the debug logs you are seeing from the daemon side as well? > > > > Thanks, > > Joanne > > > > > > > > From the crash logs you provided below, it looks like what's happening > > > > is that if the process gets killed, the timer isn't getting deleted. > > When I looked at this log previously, I thought you were repro-ing by > killing the fuse daemon process, not the cat process. When we kill the > cat process, the timer shouldn't be getting deleted. (if the daemon > itself is killed, the timers get deleted) > > > > > I'll look more into what happens in fuse when a process is killed and > > > > get back to you on this. > > This is the flow of what is happening on the kernel side (verified by > local printks) - > > `cat /tmp/fuse/hello`: > Issues a FUSE_READ background request (via fuse_send_readpages(), > fm->fc->async_read). This request will have a timeout of 10 seconds on > it > > The cat process is killed: > This does not clean up the request. The request is still on the fpq > processing list. > > Timeout on request expires: > The timeout handler runs and properly cleans up / frees the request. > > Fuse daemon wakes from sleep and replies to the request: > In dev_do_write(), the kernel won't be able to find this request > (since it timed out and was removed from the fpq processing list) and > return with -ENOENT Thank you for your explanation. I will verify if there are any issues with my test environment. -- Regards Yafang