Re: INFO: task hung in iterate_supers

Dominique Martinet <asmadeus@xxxxxxxxxxxxx> · Thu, 11 Aug 2022 15:53:14 +0900

Hi,

Tetsuo Handa wrote on Thu, Aug 11, 2022 at 03:01:23PM +0900:
> https://syzkaller.appspot.com/text?tag=CrashReport&x=154869fd080000
> suggests that p9_client_rpc() is trapped at infinite retry loop

Would be far from the first one, Dmitry brought this up years ago...

> But why does p9 think that Flush operation worth retrying forever?

I can't answer much more than "it's how it was done"; I started
implementing asynchronous flush back when this was first discussed but
my implementation introduced a regression somewhere and I never had time
to debug it; the main "problem" is that we (currently) have no way of
freeing up resources associated with that request if we leave the
thread.
The first step was adding refcounting to requests and this is somewhat
holding up, so all's left now would be to properly clean things up if we
leave this call.

You can find inspiration in my old patches[1] if you'd like to give it a
try:
[1] https://lore.kernel.org/all/20181217110111.GB17466@nautica/T/

Note that there is one point that wasn't discussed back then, but
according to the 9p man page for flush[2], the request should be
considered successful if the original request's reply comes before the
flush reply.
This might be important e.g. with caching enabled and mkdir, create or
unlink with caching enabled as the 9p client has no notion of cache
coherency... So even if the caller itself will be busy dealing with a
signal at least the cache should be kept coherent, somehow.
I don't see any way of doing that with the current 9pfs/9pnet layering,
9pnet cannot call back in the vfs.

[2] https://9fans.github.io/plan9port/man/man9/flush.html

> The peer side should be able to detect close of file descriptor on local
> side due to process termination via SIGKILL, and the peer side should be
> able to perform appropriate recovery operation even if local side cannot
> receive response for Flush operation.

The peer side (= server in my vocabulary) has no idea about processes or
file descriptors, it's the 9p client's job to do any such cleanup.

The vfs takes care of calling the proper close functions that'll end up
in clunk for fids properly, there was a report of fid leak recently but
these are rare enough...

The problem isn't open fids though, but really resources associated with
the request itself; it shouldn't be too hard to do (ignoring any cache
coherency issue), but...

> Thus, why not to give up upon SIGKILL?

... "Nobody has done it yet".

Last year I'd probably have answered that I'm open to funding, but
franlky don't have the time anyway; I'll be happy to review and lightly
test anything sent my way in my meager free time though.

(And yes, I agree ignoring sigkill is bad user experience)

-- 
Dominique