On Fri, 22 Apr 2011, Konstantin Khlebnikov wrote: > Andrew Morton wrote: > > On Thu, 21 Apr 2011 10:41:50 +0400 > > Konstantin Khlebnikov<khlebnikov@xxxxxxxxxx> wrote: > > > > > shmem_writepage() call igrab() on the inode for the page which is came > > > from > > > reclaimer to add it later into shmem_swaplist for swap-unuse operation. > > > > > > This igrab() can race with super-block deactivating process: > > > > > > shrink_inactive_list() deactivate_super() > > > pageout() tmpfs_fs_type->kill_sb() > > > shmem_writepage() kill_litter_super() > > > generic_shutdown_super() > > > evict_inodes() > > > igrab() > > > atomic_read(&inode->i_count) > > > skip-inode > > > iput() > > > if (!list_empty(&sb->s_inodes)) > > > printk("VFS: Busy inodes after... > > > > > > This igrap-iput pair was added in commit 1b1b32f2c6f6bb3253 > > > based on incorrect assumptions: Konstantin, many thanks for discovering this issue, and please accept my apology for being so slow to respond. I have to find enough "cool" time to think my way back into all the races which may occur here. I am disappointed with igrab! It appeared to be the tool I needed when I added that in, when I was concerned with deletion racing with writepage and swapoff. Clearly I didn't research igrab enough: its interface seemed right, and I never imagined it gave no equivalent protection against unmount - well, it does protect the inode, but you have to pay for that with a "Self-destruct in 5 seconds" message. I'm surprised that nothing else has such a problem with igrab, but must accept I got it wrong, and I'm using it where it's not usable. And it got more obviously unusable with 2.6.37's 63997e98a3be which removed the second "safety" call to invalidate/evict_inodes() from generic_shutdown_super(). > > > > > > : Ah, I'd never suspected it, but shmem_writepage's swaplist > > > manipulation > > > : is unsafe: though still hold page lock, which would hold off inode > > > : deletion if the page were i pagecache, it doesn't hold off once it's > > > in > > > : swapcache (free_swap_and_cache doesn't wait on locked pages). Hmm: we > > > : could put the the inode on swaplist earlier, but then > > > shmem_unuse_inode > > > : could never prune unswapped inodes. > > > > > > Attached locked page actually protect inode from deletion because > > > truncate_inode_pages_range() will sleep on this, so igrab not required. > > > This patch actually revert last hunk from that commit. > > > > > > > hm, is that last paragraph true? Let's look at the resulting code. > > > > > > : if (swap.val&& add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { > > : delete_from_page_cache(page); > > > > Here, the page is removed from inode->i_mapping. So > > truncate_inode_pages() won't see that page and will not block on its > > lock. > > Oops, right. Sorry. It produce use-after-free race, but it is quiet and > small. > My test is using too few files to catch it in a reasonable time, > and I ran it without slab poisoning. > > So, v1 patch is correct but little ugly, while v2 -- broken. Yes. But even if the v1 patch is correct so far as it goes (and I've not spent much time considering it, being disapppointed enough with igrab that I'd rather get away from it and solve the races internally than rely further upon VFS constructs here), it must be incomplete since the same problem will apply to the igrab in shmem_unuse_inode too. > > > > > : shmem_swp_set(info, entry, swap.val); > > : shmem_swp_unmap(entry); > > : spin_unlock(&info->lock); > > : if (list_empty(&info->swaplist)) { > > : mutex_lock(&shmem_swaplist_mutex); > > : /* move instead of add in case we're racing */ > > : list_move_tail(&info->swaplist,&shmem_swaplist); > > : mutex_unlock(&shmem_swaplist_mutex); > > : } > > > > Here, the code plays with `info', which points at storage which is > > embedded within the inode's filesystem-private part. > > > > But because the inode now has no attached locked page, a concurrent > > umount can free the inode while this code is using it. > > I guess we can try to put delete_from_page_cache(page); right before > swap_writepage > but it move it outside info->lock... You're right to be wary, we do need to do all the swizzling within info->lock. > > > > > : swap_shmem_alloc(swap); > > : BUG_ON(page_mapped(page)); > > : swap_writepage(page, wbc); > > : return 0; > > : } > > > > However, I assume that you reran your testcase with the v2 patch and > > that things ran OK. How come? Either my analysis is wrong or the > > testcase doesn't trigger races in this code path? Here's the patch I was testing last night, but I do want to test it some more (I've not even tried your unmounting case yet), and I do want to make some changes to it (some comments, and see if I can move the mem_cgroup_cache_charge outside of the mutex, making it GFP_KERNEL rather than GFP_NOFS - at the time that mem_cgroup charging went in, we did not know here if it was actually a shmem swap page, whereas nowadays we can be sure, since that's noted in the swap_map). In shmem_unuse_inode I'm widening the shmem_swaplist_mutex to protect against shmem_evict_inode; and in shmem_writepage adding to the list earlier, while holding lock on page still in pagecache to protect it. But testing last night showed corruption on this laptop (no problem on other machines): I'm guessing it's unrelated, but I can't be sure of that without more extended testing. Hugh --- 2.6.39-rc5/mm/shmem.c 2011-04-28 09:52:49.066135001 -0700 +++ linux/mm/shmem.c 2011-05-02 21:02:21.745633214 -0700 @@ -852,7 +852,7 @@ static inline int shmem_find_swp(swp_ent static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page) { - struct inode *inode; + struct address_space *mapping; unsigned long idx; unsigned long size; unsigned long limit; @@ -928,7 +928,7 @@ lost2: return 0; found: idx += offset; - inode = igrab(&info->vfs_inode); + mapping = info->vfs_inode.i_mapping; spin_unlock(&info->lock); /* @@ -940,20 +940,16 @@ found: */ if (shmem_swaplist.next != &info->swaplist) list_move_tail(&shmem_swaplist, &info->swaplist); - mutex_unlock(&shmem_swaplist_mutex); - error = 1; - if (!inode) - goto out; /* - * Charge page using GFP_KERNEL while we can wait. + * Charge page using GFP_NOFS while we can wait. * Charged back to the user(not to caller) when swap account is used. * add_to_page_cache() will be called with GFP_NOWAIT. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_NOFS); if (error) goto out; - error = radix_tree_preload(GFP_KERNEL); + error = radix_tree_preload(GFP_NOFS); if (error) { mem_cgroup_uncharge_cache_page(page); goto out; @@ -963,14 +959,14 @@ found: spin_lock(&info->lock); ptr = shmem_swp_entry(info, idx, NULL); if (ptr && ptr->val == entry.val) { - error = add_to_page_cache_locked(page, inode->i_mapping, + error = add_to_page_cache_locked(page, mapping, idx, GFP_NOWAIT); /* does mem_cgroup_uncharge_cache_page on error */ } else /* we must compensate for our precharge above */ mem_cgroup_uncharge_cache_page(page); if (error == -EEXIST) { - struct page *filepage = find_get_page(inode->i_mapping, idx); + struct page *filepage = find_get_page(mapping, idx); error = 1; if (filepage) { /* @@ -995,9 +991,6 @@ found: spin_unlock(&info->lock); radix_tree_preload_end(); out: - unlock_page(page); - page_cache_release(page); - iput(inode); /* allows for NULL */ return error; } @@ -1016,7 +1009,7 @@ int shmem_unuse(swp_entry_t entry, struc found = shmem_unuse_inode(info, entry, page); cond_resched(); if (found) - goto out; + break; } mutex_unlock(&shmem_swaplist_mutex); /* @@ -1025,7 +1018,6 @@ int shmem_unuse(swp_entry_t entry, struc */ unlock_page(page); page_cache_release(page); -out: return (found < 0) ? found : 0; } @@ -1039,6 +1031,7 @@ static int shmem_writepage(struct page * struct address_space *mapping; unsigned long index; struct inode *inode; + bool unlock_mutex = false; BUG_ON(!PageLocked(page)); mapping = page->mapping; @@ -1064,7 +1057,17 @@ static int shmem_writepage(struct page * else swap.val = 0; + if (swap.val && list_empty(&info->swaplist)) { + mutex_lock(&shmem_swaplist_mutex); + /* move instead of add in case we're racing */ + list_move_tail(&info->swaplist, &shmem_swaplist); + unlock_mutex = true; + } + spin_lock(&info->lock); + if (unlock_mutex) + mutex_unlock(&shmem_swaplist_mutex); + if (index >= info->next_index) { BUG_ON(!(info->flags & SHMEM_TRUNCATE)); goto unlock; @@ -1084,21 +1087,10 @@ static int shmem_writepage(struct page * delete_from_page_cache(page); shmem_swp_set(info, entry, swap.val); shmem_swp_unmap(entry); - if (list_empty(&info->swaplist)) - inode = igrab(inode); - else - inode = NULL; spin_unlock(&info->lock); swap_shmem_alloc(swap); BUG_ON(page_mapped(page)); swap_writepage(page, wbc); - if (inode) { - mutex_lock(&shmem_swaplist_mutex); - /* move instead of add in case we're racing */ - list_move_tail(&info->swaplist, &shmem_swaplist); - mutex_unlock(&shmem_swaplist_mutex); - iput(inode); - } return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>