On Fri, 5 Apr 2019, Konstantin Khlebnikov wrote: > On 05.04.2019 5:12, Hugh Dickins wrote: > > Hi Alex, could you please give the patch below a try? It fixes a > > problem, but I'm not sure that it's your problem - please let us know. > > > > I've not yet written up the commit description, and this should end up > > as 4/4 in a series fixing several new swapoff issues: I'll wait to post > > the finished series until heard back from you. > > > > I did first try following the suggestion Konstantin had made back then, > > for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active). > > > > But it turned out to be difficult to get right in shmem_unuse(), because > > of the way that relies on the inode as a cursor in the list - problem > > when you've acquired an s_active reference, but fail to acquire inode > > reference, and cannot safely release the s_active reference while still > > holding the swaplist mutex. > > > > If VFS offered an isgrab(inode), like igrab() but acquiring s_active > > reference while holding i_lock, that would drop very easily into the > > current shmem_unuse() as a replacement there for igrab(). But the rest > > of the world has managed without that for years, so I'm disinclined to > > add it just for this. And the patch below seems good enough without it. > > > > Thanks, > > Hugh > > > > --- > > > > include/linux/shmem_fs.h | 1 + > > mm/shmem.c | 39 ++++++++++++++++++--------------------- > > 2 files changed, 19 insertions(+), 21 deletions(-) > > > > --- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700 > > +++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700 > > @@ -21,6 +21,7 @@ struct shmem_inode_info { > > struct list_head swaplist; /* chain of maybes on swap */ > > struct shared_policy policy; /* NUMA memory alloc policy > > */ > > struct simple_xattrs xattrs; /* list of xattrs */ > > + atomic_t stop_eviction; /* hold when working on inode > > */ > > struct inode vfs_inode; > > }; > > --- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700 > > +++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700 > > @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino > > } > > spin_unlock(&sbinfo->shrinklist_lock); > > } > > - if (!list_empty(&info->swaplist)) { > > + while (!list_empty(&info->swaplist)) { > > + /* Wait while shmem_unuse() is scanning this inode... > > */ > > + wait_var_event(&info->stop_eviction, > > + !atomic_read(&info->stop_eviction)); > > mutex_lock(&shmem_swaplist_mutex); > > list_del_init(&info->swaplist); > > + /* ...but beware of the race if we peeked too early > > */ > > + if (!atomic_read(&info->stop_eviction)) > > + list_del_init(&info->swaplist); > > mutex_unlock(&shmem_swaplist_mutex); > > } > > } > > @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool > > unsigned long *fs_pages_to_unuse) > > { > > struct shmem_inode_info *info, *next; > > - struct inode *inode; > > - struct inode *prev_inode = NULL; > > int error = 0; > > if (list_empty(&shmem_swaplist)) > > return 0; > > mutex_lock(&shmem_swaplist_mutex); > > - > > - /* > > - * The extra refcount on the inode is necessary to safely dereference > > - * p->next after re-acquiring the lock. New shmem inodes with swap > > - * get added to the end of the list and we will scan them all. > > - */ > > list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) { > > if (!info->swapped) { > > list_del_init(&info->swaplist); > > continue; > > } > > - > > - inode = igrab(&info->vfs_inode); > > - if (!inode) > > - continue; > > - > > + /* > > + * Drop the swaplist mutex while searching the inode for > > swap; > > + * but before doing so, make sure shmem_evict_inode() will > > not > > + * remove placeholder inode from swaplist, nor let it be > > freed > > + * (igrab() would protect from unlink, but not from unmount). > > + */ > > + atomic_inc(&info->stop_eviction); > > mutex_unlock(&shmem_swaplist_mutex); > > - if (prev_inode) > > - iput(prev_inode); > > - prev_inode = inode; > This seems too ad hoc solution. I see what you mean by "ad hoc", but disagree with "too" ad hoc: it's an appropriate solution, and a general one - I didn't invent it for this, but for the huge tmpfs recoveries work items four years ago; just changed the name from "info->recoveries" to "info->stop_eviction" to let it be generalized to this swapoff case. I prefer mine, since it simplifies shmem_unuse() (no igrab!), and has the nice (but admittedly not essential) property of letting swapoff proceed without delay and without unnecessary locking on unmounting filesystems and evicting inodes. (Would I prefer to use the s_umount technique for my recoveries case? I think not.) But yours should work too, with a slight change - see comments below, where I've inlined yours. I'd better get on and post my four fixes tomorrow, whether or not they fix Alex's case; then if people prefer yours to my 4/4, yours can be swapped in instead. > shmem: fix race between shmem_unuse and umount > > From: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > > Function shmem_unuse could race with generic_shutdown_super. > Inode reference is not enough for preventing umount and freeing superblock. > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > --- > mm/shmem.c | 24 +++++++++++++++++++++++- > 1 file changed, 23 insertions(+), 1 deletion(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index b3db3779a30a..2018a9a96bb7 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type, > return ret; > } > > +static void shmem_synchronize_umount(struct super_block *sb, void *arg) > +{ > +} > + I think this can go away, see below. > /* > * Read all the shared memory data that resides in the swap > * device 'type' back into memory, so the swap device can be > @@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap, > struct shmem_inode_info *info, *next; > struct inode *inode; > struct inode *prev_inode = NULL; > + struct super_block *sb; > int error = 0; > > if (list_empty(&shmem_swaplist)) > @@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap, > continue; > } > > + /* > + * Lock superblock to prevent umount and freeing it under us. > + * If umount in progress it will free swap enties. > + * > + * Must be done before grabbing inode reference, otherwise > + * generic_shutdown_super() will complain about busy inodes. > + */ > + sb = info->vfs_inode.i_sb; > + if (!trylock_super(sb)) Right, trylock important there. > + continue; > + > inode = igrab(&info->vfs_inode); > - if (!inode) > + if (!inode) { > + up_read(&sb->s_umount); Yes, that indeed avoids the difficulty I had with when to call deactivate_super(), that put me off trying to use s_active. > continue; > + } > > mutex_unlock(&shmem_swaplist_mutex); > if (prev_inode) > @@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap, > > error = shmem_unuse_inode(inode, type, frontswap, > fs_pages_to_unuse); > + up_read(&sb->s_umount); No, not here. I think you have to note prev_sb, and then only up_read(&prev_sb->s_umount) after each iput(prev_inode): otherwise there's still a risk of "Self-destruct in 5 seconds", isn't there? > cond_resched(); > > mutex_lock(&shmem_swaplist_mutex); > @@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap, > if (prev_inode) > iput(prev_inode); > > + /* Wait for umounts, this grabs s_umount for each superblock. */ > + iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL); > + I guess that's an attempt to compensate for the somewhat unsatisfactory trylock above (bearing in mind the SWAP_UNUSE_MAX_TRIES 3, but I remove that in my 2/4). Nice idea, and if it had the effect of never needing to retry shmem_unuse(), I'd say yes; but since you're still passing over un-igrab()-able inodes without an equivalent synchronization, I think this odd iterate_supers_type() just delays swapoff without buying any guarantee: better just deleted to keep your patch simpler. > return error; > } > Thanks, Hugh