Re: 9p: fscache duplicate cookie

Dominique Martinet <asmadeus@xxxxxxxxxxxxx> · Sat, 8 May 2021 09:47:08 +0900

Luis Henriques wrote on Fri, May 07, 2021 at 05:36:29PM +0100:
> Ok, I spent some more time on this issue today.  I've hacked a bit of code
> to keep track of new inodes' qids and I'm convinced there are no
> duplicates when this issue happens.

Ok, that's good.
Just to make sure what did you look at aside of the qid to make sure
it's unique? i_ino comes straight from qid->path too so we don't have
any great key available which is why I hadn't suggesting building a
debug table.
(... well, actually that means we'll never try to allocate two inodes
with the same inode number because of how v9fs_qid_iget_dotl() works, so
if there is a collision in qid paths we wouldn't see it through cookies
collision in the first place. I'm not sure that's good? But at least
that clears up that theory, sorry for the bad suggestion)

> OTOH, I've done another quick test: in v9fs_cache_inode_get_cookie(), I do
> an fscache_acquire_cookie() retry when it fails (due to the dup error),
> and this retry *does* succeed.  Which means, I guess, there's a race going
> on.  I didn't managed to look too deep yet, but my current theory is that
> the inode is being evicted while an open is triggered.  A new inode is
> allocated but the old inode fscache cookie hasn't yet been relinquished.
> Does this make any sense?

hm, if the retry goes through I guess that'd make sense; if they both
were used in parallel the second call should fail all the same so that's
definitely a likely explanation.

It wouldn't hurt to check with v9fs_evict_inode if that's correct...
There definitely is a window where inode is no longer findable (thus
leading to allocation of a new one) and the call to the
fscache_relinquish_cookie() at eviction, but looking at e.g. afs they
are doing exactly the same as 9p here (iget5_locked, if that gets a new
inode then call fscache_acquire_cookie // fscache_relinquish_cookie in
evict op) so I'm not sure what we're missing.

David, do you have an idea?

> If this theory is correct, I'm not sure what's the best way to close this
> race because the v9inode->fscache_lock can't really be used.  But I still
> need to proof this is really what's happening.

Yes, I think we're going to need proof before thinking of a solution, I
can't think of anything simple either.

Thanks again for looking into it,
-- 
Dominique