Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs

Mateusz Guzik <mjguzik@xxxxxxxxx> · Sat, 20 Jul 2024 12:42:15 +0200

On Fri, Jul 19, 2024 at 11:21:51AM +0800, Zhihao Cheng wrote:
> 在 2024/7/18 21:40, Jan Kara 写道:
> > I'm pondering about the best way to fix this. Maybe we could handle the
> > need for inode pinning in inode_lru_isolate() in a similar way as in
> > writeback code so that last iput() cannot happen from inode_lru_isolate().
> > In writeback we use I_SYNC flag to pin the inode and evict() waits for this
> > flag to clear. I'll probably sleep to it and if I won't find it too
> > disgusting to live tomorrow, I can code it.
> > 
> 
> I guess that you may modify like this:
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..5b1a9b23f53f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);
> 
>  static void __inode_add_lru(struct inode *inode, bool rotate)
>  {
> -       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING |
> I_WILL_FREE))
> +       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE
> | I_PINING))
>                 return;
>         if (atomic_read(&inode->i_count))
>                 return;
> @@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
>          * be under pressure before the cache inside the highmem zone.
>          */
>         if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> -               __iget(inode);
> +               inode->i_state |= I_PINING;
>                 spin_unlock(&inode->i_lock);
>                 spin_unlock(lru_lock);
>                 if (remove_inode_buffers(inode)) {
> @@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
>                                 __count_vm_events(PGINODESTEAL, reap);
>                         mm_account_reclaimed_pages(reap);
>                 }
> -               iput(inode);
> +               spin_lock(&inode->i_lock);
> +               inode->i_state &= ~I_PINING;
> +               wake_up_bit(&inode->i_state, __I_PINING);
> +               spin_unlock(&inode->i_lock);
>                 spin_lock(lru_lock);
>                 return LRU_RETRY;
>         }
> @@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
>                 return;
>         }
> 
> +       inode_wait_for_pining(inode);
>         state = inode->i_state;
>         if (!drop) {
>                 WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..daf094fff5fe 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb *kiocb,
> struct kiocb *kiocb_src,
>  #define I_DONTCACHE            (1 << 16)
>  #define I_SYNC_QUEUED          (1 << 17)
>  #define I_PINNING_NETFS_WB     (1 << 18)
> +#define __I_PINING             19
> +#define I_PINING               (1 << __I_PINING)
> 
>  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> 
> , which means that we will import a new inode state to solve the problem.
> 

My non-maintainer $0,03 is as follows:

1. I_PINING is too generic of a name. I_LRU_PINNED or something else
indicating what this is for would be prudent
2. while not specific to this patch, the handling of i_state is too
accidental-breakage friendly. a full blown solution is way out of the
scope here, but something can be done to future-proof this work anyway.

To that end I would suggest:
1. inode_lru_pin() which appart from setting the flag includes:
	BUG_ON(inode->i_state & (I_LRU_PINNED | I_FREEING | I_WILL_FREE)
2. inode_lru_unpin() which apart from unsetting the flag + wakeup includes:
	BUG_ON(!(inode->i_state & I_LRU_PINNED))
3. inode_lru_wait_for_pinned() 

However, a non-cosmetic remark is that at the spot inode_wait_for_pining
gets invoked none of the of the pinning-blocking flags may be set (to my
reading anyway). This is not the end of the world, but it does mean the
waiting routine will have to check stuff in a loop.

Names are not that important, the key is to keep the logic and
dependencies close by code-wise.