Luis Henriques <lhenriques@xxxxxxx> writes: > Dominique Martinet <asmadeus@xxxxxxxxxxxxx> writes: > >> Luis Henriques wrote on Fri, May 07, 2021 at 05:36:29PM +0100: >>> Ok, I spent some more time on this issue today. I've hacked a bit of code >>> to keep track of new inodes' qids and I'm convinced there are no >>> duplicates when this issue happens. >> >> Ok, that's good. >> Just to make sure what did you look at aside of the qid to make sure >> it's unique? i_ino comes straight from qid->path too so we don't have >> any great key available which is why I hadn't suggesting building a >> debug table. >> (... well, actually that means we'll never try to allocate two inodes >> with the same inode number because of how v9fs_qid_iget_dotl() works, so >> if there is a collision in qid paths we wouldn't see it through cookies >> collision in the first place. I'm not sure that's good? But at least >> that clears up that theory, sorry for the bad suggestion) >> > > Ok, I should probably have added some more details in my email. So, > here's what I did: > > I've created a list (actually a tree...) of objects that keep track of > each new inode in v9fs_qid_iget_dotl(). These objects contain the inode > number, and the qid (type, version, path). These are then removed from > the list in v9fs_evict_inode(). > > Whenever there's an error in v9fs_cache_inode_get_cookie(), i.e. when > v9inode->fscache == NULL, I'll search this list to see if that inode > number was there (or if I can find the qid.path and qid.version). > > I have never seen a hit in this search, hence my claim of not having a > duplicate qids involved. Of course my hack can be buggy and I completely > miss it ;-) > >>> OTOH, I've done another quick test: in v9fs_cache_inode_get_cookie(), I do >>> an fscache_acquire_cookie() retry when it fails (due to the dup error), >>> and this retry *does* succeed. Which means, I guess, there's a race going >>> on. I didn't managed to look too deep yet, but my current theory is that >>> the inode is being evicted while an open is triggered. A new inode is >>> allocated but the old inode fscache cookie hasn't yet been relinquished. >>> Does this make any sense? >> >> hm, if the retry goes through I guess that'd make sense; if they both >> were used in parallel the second call should fail all the same so that's >> definitely a likely explanation. >> >> It wouldn't hurt to check with v9fs_evict_inode if that's correct... >> There definitely is a window where inode is no longer findable (thus >> leading to allocation of a new one) and the call to the >> fscache_relinquish_cookie() at eviction, but looking at e.g. afs they >> are doing exactly the same as 9p here (iget5_locked, if that gets a new >> inode then call fscache_acquire_cookie // fscache_relinquish_cookie in >> evict op) so I'm not sure what we're missing. >> >> >> David, do you have an idea? > > I've just done a quick experiment: moving the call to function > v9fs_cache_inode_put_cookie() in v9fs_evict_inode() to the beginning > (before truncate_inode_pages_final()) and it seems to, at least, narrow > the window -- I'm not able to reproduce the issue anymore. But I'll need > to look closer. > And right after sending this email I decided to try a different experiment. Here's the code I had in v9fs_evict_inode(): void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); v9fs_debug_remove(inode->i_ino); /* <------------------------- */ truncate_inode_pages_final(&inode->i_data); clear_inode(inode); filemap_fdatawrite(&inode->i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ if (v9inode->writeback_fid) { p9_client_clunk(v9inode->writeback_fid); v9inode->writeback_fid = NULL; } } v9fs_debug_remove() is the function that would remove the inode from my debug tree. In my new experiment, I changed this with: void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); v9fs_debug_tag(inode->i_ino); (...) v9fs_debug_remove(inode->i_ino); } So, I effectively "tag" the inode for removing it from the list but only remove it in the end. The result is that I actually started seeing this inode tagged for removing in v9fs_cache_inode_get_cookie() when fscache detects the duplicate cookie! I'm attaching the debug patch I'm using. Obviously, I'm not really proud of this code and is not for merging (it's *really* hacky!), but maybe it helps clarifying what I tried to explain above. I.e. that fscache_relinquish_cookie() should probably be invoked early when evicting an inode. FTR, with this patch applied I occasionally (but not always!) see the following: [DEBUG] inode: 24187397 quid: 0.1711203.633864d4 [DEBUG] found in tree qid: 0.1711203.633864d4 removing: 1 [DEBUG] found dup ino: 24187397 0.1711203.633864d4 removing: 1 (Sometimes I do not see the duplicate being found, which probably means I didn't hit the race.) David, does this make sense? Cheers -- Luis
diff --git a/fs/9p/cache.c b/fs/9p/cache.c index eb2151fb6049..59b59291609c 100644 --- a/fs/9p/cache.c +++ b/fs/9p/cache.c @@ -122,6 +122,30 @@ void v9fs_cache_inode_get_cookie(struct inode *inode) i_size_read(&v9inode->vfs_inode), true); + if (!v9inode->fscache) { + struct v9fs_debug *db, *next; + + mutex_lock(&v9fs_debug_tree_mutex); + db = v9fs_debug_search(inode->i_ino); + printk("[DEBUG] inode: %ld quid: %x.%llx.%x\n", + inode->i_ino, + v9inode->qid.type, + (unsigned long long)v9inode->qid.path, + v9inode->qid.version); + if (db) + printk("[DEBUG] found in tree qid: %x.%llx.%x removing: %d\n", + db->type, (unsigned long long)db->path, + db->version, db->removing); + rbtree_postorder_for_each_entry_safe(db, next, &v9fs_debug_tree, node) { + if (db->path == v9inode->qid.path && + db->version == v9inode->qid.version) + printk("[DEBUG] found dup ino: %ld %x.%llx.%x removing: %d\n", + db->inode, + db->type, (unsigned long long)db->path, + db->version, db->removing); + } + mutex_unlock(&v9fs_debug_tree_mutex); + } p9_debug(P9_DEBUG_FSC, "inode %p get cookie %p\n", inode, v9inode->fscache); } diff --git a/fs/9p/v9fs.h b/fs/9p/v9fs.h index 4ca56c5dd637..7935ab56453c 100644 --- a/fs/9p/v9fs.h +++ b/fs/9p/v9fs.h @@ -119,6 +119,22 @@ struct v9fs_inode { struct inode vfs_inode; }; +struct v9fs_debug { + struct rb_node node; + unsigned long inode; + u8 type; + u32 version; + u64 path; + bool removing; +}; + +extern struct rb_root v9fs_debug_tree; +extern struct mutex v9fs_debug_tree_mutex; +extern void v9fs_debug_remove(unsigned long inode); +extern void v9fs_debug_tag(unsigned long inode); +extern struct v9fs_debug *v9fs_debug_search(unsigned long inode); +extern void v9fs_debug_insert(unsigned long inode, u8 type, u32 version, u64 path); + static inline struct v9fs_inode *V9FS_I(const struct inode *inode) { return container_of(inode, struct v9fs_inode, vfs_inode); diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 795706520b5e..e5ebec787c28 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -377,6 +377,7 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); + v9fs_debug_tag(inode->i_ino); truncate_inode_pages_final(&inode->i_data); clear_inode(inode); filemap_fdatawrite(&inode->i_data); @@ -387,6 +388,7 @@ void v9fs_evict_inode(struct inode *inode) p9_client_clunk(v9inode->writeback_fid); v9inode->writeback_fid = NULL; } + v9fs_debug_remove(inode->i_ino); } static int v9fs_test_inode(struct inode *inode, void *data) diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c index e1c0240b51c0..56a7bc0c627b 100644 --- a/fs/9p/vfs_inode_dotl.c +++ b/fs/9p/vfs_inode_dotl.c @@ -32,6 +32,9 @@ #include "xattr.h" #include "acl.h" +struct rb_root v9fs_debug_tree = RB_ROOT; +DEFINE_MUTEX(v9fs_debug_tree_mutex); + static int v9fs_vfs_mknod_dotl(struct user_namespace *mnt_userns, struct inode *dir, struct dentry *dentry, umode_t omode, dev_t rdev); @@ -94,6 +97,94 @@ static int v9fs_set_inode_dotl(struct inode *inode, void *data) return 0; } +struct v9fs_debug *v9fs_debug_search(unsigned long inode) +{ + struct rb_node *node = v9fs_debug_tree.rb_node; + + while (node) { + struct v9fs_debug *data = container_of(node, struct v9fs_debug, node); + + if (data->inode < inode) + node = node->rb_left; + else if (data->inode > inode) + node = node->rb_right; + else + return data; + } + return NULL; +} + +void v9fs_debug_tag(unsigned long inode) +{ + struct v9fs_debug *node; + + mutex_lock(&v9fs_debug_tree_mutex); + node = v9fs_debug_search(inode); + if (node) + node->removing = true; + else + printk("[DEBUG] can't find %ld for tagging\n", inode); + mutex_unlock(&v9fs_debug_tree_mutex); +} + +void v9fs_debug_remove(unsigned long inode) +{ + struct v9fs_debug *node; + + mutex_lock(&v9fs_debug_tree_mutex); + node = v9fs_debug_search(inode); + if (node) { + rb_erase(&node->node, &v9fs_debug_tree); + kfree(node); + } else + printk("[DEBUG] can't find %ld\n", inode); + mutex_unlock(&v9fs_debug_tree_mutex); +} + +void v9fs_debug_insert(unsigned long inode, u8 type, u32 version, u64 path) +{ + struct v9fs_debug *db = NULL; + struct rb_node **node, *parent = NULL; + + mutex_lock(&v9fs_debug_tree_mutex); + node = &(v9fs_debug_tree.rb_node); + while (*node) { + parent = *node; + db = container_of(*node, struct v9fs_debug, node); + + if (db->inode < inode) + node = &((*node)->rb_left); + else if (db->inode > inode) + node = &((*node)->rb_right); + else + break; + } + if (!db || (db->inode != inode)) { + db = kmalloc(sizeof(*db), GFP_KERNEL); + if (db) { + db->inode = inode; + db->type = type; + db->version = version; + db->path = path; + db->removing = false; + rb_link_node(&db->node, parent, node); + rb_insert_color(&db->node, &v9fs_debug_tree); + } else + printk("[DEBUG] Failed to alloc memory!\n"); + } + if (db) { + if (type != db->type || + version != db->version || + path != db->path) { + db->inode = inode; + db->type = type; + db->version = version; + db->path = path; + } + } + mutex_unlock(&v9fs_debug_tree_mutex); +} + static struct inode *v9fs_qid_iget_dotl(struct super_block *sb, struct p9_qid *qid, struct p9_fid *fid, @@ -134,6 +225,7 @@ static struct inode *v9fs_qid_iget_dotl(struct super_block *sb, if (retval) goto error; + v9fs_debug_insert(inode->i_ino, qid->type, qid->version, qid->path); unlock_new_inode(inode); return inode; error:
> And yeah, I had checked other filesystems too and they seem to follow this > pattern. So either I'm looking at the wrong place or this is something > that is much more likely to be triggered by 9p. > > Cheers, > -- > Luis > >>> If this theory is correct, I'm not sure what's the best way to close this >>> race because the v9inode->fscache_lock can't really be used. But I still >>> need to proof this is really what's happening. >> >> Yes, I think we're going to need proof before thinking of a solution, I >> can't think of anything simple either. >> >> >> Thanks again for looking into it, >> -- >> Dominique