Hi, Tetsuo Handa wrote on Thu, Aug 11, 2022 at 03:01:23PM +0900: > https://syzkaller.appspot.com/text?tag=CrashReport&x=154869fd080000 > suggests that p9_client_rpc() is trapped at infinite retry loop Would be far from the first one, Dmitry brought this up years ago... > But why does p9 think that Flush operation worth retrying forever? I can't answer much more than "it's how it was done"; I started implementing asynchronous flush back when this was first discussed but my implementation introduced a regression somewhere and I never had time to debug it; the main "problem" is that we (currently) have no way of freeing up resources associated with that request if we leave the thread. The first step was adding refcounting to requests and this is somewhat holding up, so all's left now would be to properly clean things up if we leave this call. You can find inspiration in my old patches[1] if you'd like to give it a try: [1] https://lore.kernel.org/all/20181217110111.GB17466@nautica/T/ Note that there is one point that wasn't discussed back then, but according to the 9p man page for flush[2], the request should be considered successful if the original request's reply comes before the flush reply. This might be important e.g. with caching enabled and mkdir, create or unlink with caching enabled as the 9p client has no notion of cache coherency... So even if the caller itself will be busy dealing with a signal at least the cache should be kept coherent, somehow. I don't see any way of doing that with the current 9pfs/9pnet layering, 9pnet cannot call back in the vfs. [2] https://9fans.github.io/plan9port/man/man9/flush.html > The peer side should be able to detect close of file descriptor on local > side due to process termination via SIGKILL, and the peer side should be > able to perform appropriate recovery operation even if local side cannot > receive response for Flush operation. The peer side (= server in my vocabulary) has no idea about processes or file descriptors, it's the 9p client's job to do any such cleanup. The vfs takes care of calling the proper close functions that'll end up in clunk for fids properly, there was a report of fid leak recently but these are rare enough... The problem isn't open fids though, but really resources associated with the request itself; it shouldn't be too hard to do (ignoring any cache coherency issue), but... > Thus, why not to give up upon SIGKILL? ... "Nobody has done it yet". Last year I'd probably have answered that I'm open to funding, but franlky don't have the time anyway; I'll be happy to review and lightly test anything sent my way in my meager free time though. (And yes, I agree ignoring sigkill is bad user experience) -- Dominique