[PATCH for v4.9] fs: don't scan the inode cache before SB_BORN is set

Aaron Lu <aaron.lu@xxxxxxxxxxxxxxxxx> · Mon, 28 Jan 2019 21:20:45 +0800

One of our servers recently hit a kernel crash and the callstack is:

[6469391.997662] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
[6469392.005693] IP: [<ffffffff811cad80>] shmem_unused_huge_count+0x10/0x20
[6469392.012412] PGD 1000c21067
[6469392.015203] PUD ffc306067
[6469392.018089] PMD 0
[6469392.018627]
[6469392.020303] Oops: 0000 [#1] SMP
[6469392.023621] Modules linked in: kpatch_6iljwh9b(OE) memcg_force_swapin(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd auth_rpcgss nfs_acl [last unloaded: memcg_force_swapin]
[6469392.040177] CPU: 2 PID: 89058 Comm: ilogtail Tainted: G           OE K 4.9.93-010.ali3000.alios7.x86_64 #1
[6469392.049996] Hardware name: Inventec     K900-1G                         /B900G2-1G       , BIOS A2.32 10/09/2014
[6469392.060334] task: ffff8802217b1800 task.stack: ffffc9004ea88000
[6469392.066418] RIP: 0010:[<ffffffff811cad80>]  [<ffffffff811cad80>] shmem_unused_huge_count+0x10/0x20
[6469392.075563] RSP: 0018:ffffc9004ea8b6c0  EFLAGS: 00010282
[6469392.081041] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000001
[6469392.088339] RDX: 0000000000000001 RSI: ffffc9004ea8b780 RDI: ffff881749bd2000
[6469392.095635] RBP: ffffc9004ea8b6c0 R08: 28f5c28f5c28f5c3 R09: ffff88173bf3fce0
[6469392.102934] R10: ffff88207ffd4000 R11: 0000000000000000 R12: ffff881749bd24c0
[6469392.110233] R13: ffffc9004ea8b780 R14: 0000000000000000 R15: ffff88207ffd4000
[6469392.117533] FS:  00007fe260420700(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000
[6469392.125792] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6469392.131703] CR2: 0000000000000070 CR3: 00000005bb46d000 CR4: 00000000001606f0
[6469392.138999] Stack:
[6469392.141185]  ffffc9004ea8b6f0 ffffffff81247bee 0000000000000020 0000000000000400
[6469392.148811]  ffff881749bd24c0 0000000000000000 ffffc9004ea8b7d0 ffffffff811c431c
[6469392.156436]  0000000000000020 0000000000000000 ffff88207b82c000 0000000000000001
[6469392.164063] Call Trace:
[6469392.166692]  [<ffffffff81247bee>] super_cache_count+0x3e/0xe0
[6469392.172607]  [<ffffffff811c431c>] shrink_slab.part.38+0x11c/0x420
[6469392.178875]  [<ffffffff811c4649>] shrink_slab+0x29/0x30
[6469392.184273]  [<ffffffff811c93cf>] shrink_node+0xff/0x300
[6469392.189756]  [<ffffffff811c96dd>] do_try_to_free_pages+0x10d/0x330
[6469392.196104]  [<ffffffff811c9b65>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[6469392.203063]  [<ffffffff81230b5d>] try_charge+0x14d/0x720
[6469392.208551]  [<ffffffff8121b8e3>] ? kmem_cache_alloc+0xd3/0x1a0
[6469392.214642]  [<ffffffff811b14e5>] ? mempool_alloc_slab+0x15/0x20
[6469392.220825]  [<ffffffff81235b4e>] mem_cgroup_try_charge+0x6e/0x1b0
[6469392.227177]  [<ffffffff811ae174>] __add_to_page_cache_locked+0x64/0x220
[6469392.233961]  [<ffffffff811ae39e>] add_to_page_cache_lru+0x4e/0xe0
[6469392.240242]  [<ffffffffa03ce2d1>] ext4_mpage_readpages+0x151/0x980 [ext4]
[6469392.247211]  [<ffffffffa037edb5>] ext4_readpages+0x35/0x40 [ext4]
[6469392.253474]  [<ffffffff811be9e7>] __do_page_cache_readahead+0x197/0x240
[6469392.260260]  [<ffffffff811ae45c>] ? pagecache_get_page+0x2c/0x2a0
[6469392.266523]  [<ffffffff811b0f4b>] filemap_fault+0x4db/0x590
[6469392.272282]  [<ffffffffa0388fd6>] ext4_filemap_fault+0x36/0x50 [ext4]
[6469392.278896]  [<ffffffff811e4a90>] __do_fault+0x80/0x170
[6469392.284292]  [<ffffffff811e87b2>] do_fault+0x4c2/0x720
[6469392.289603]  [<ffffffff8111513f>] ? futex_wait_queue_me+0x9f/0x120
[6469392.295954]  [<ffffffff811e9162>] handle_mm_fault+0x512/0xc90
[6469392.301874]  [<ffffffff8106eb8b>] __do_page_fault+0x24b/0x4d0
[6469392.307796]  [<ffffffff811184c5>] ? SyS_futex+0x85/0x170
[6469392.313280]  [<ffffffff8106ee40>] do_page_fault+0x30/0x80
[6469392.318850]  [<ffffffff81003bf4>] ? do_syscall_64+0x74/0x180
[6469392.324679]  [<ffffffff81722b68>] page_fault+0x28/0x30
[6469392.329986] Code: 00 48 83 43 38 01 4c 89 e7 c6 <48> 8b 40 70 5d c3 66 2e 0f 1f 84
[6469392.338183] RIP  [<ffffffff811cad80>] shmem_unused_huge_count+0x10/0x20
[6469392.344990]  RSP <ffffc9004ea8b6c0>
[6469392.348656] CR2: 0000000000000070

Google showed me Dave Chinner's fix and I think it is the right fix for
our problem(not easy to reproduce in our production environment so I
haven't been able to confirm).

Unfortunately, this commit is only back ported to v4.14 and v4.16 stable
kernel, not v4.9 stable kernel, presumbly due to the rename of MS_BORN
to SB_BORN starting from v4.14. To make this patch work on v4.9, I have
done one minor change to Dave's commit: by keep using MS_BORN. I think
this is correct, but since I know very little about fs code, please
kindly review, thanks a lot for your time.

>From 5cdf1679c9120a173a2bc9dff214332e99f741bc Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri, 11 May 2018 11:20:57 +1000
Subject: [PATCH] fs: don't scan the inode cache before SB_BORN is set

commit 79f546a696bff2590169fb5684e23d65f4d9f591 upstream.

We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.  It produces
an oops down this path during the failed mount:

  radix_tree_gang_lookup_tag+0xc4/0x130
  xfs_perag_get_tag+0x37/0xf0
  xfs_reclaim_inodes_count+0x32/0x40
  xfs_fs_nr_cached_objects+0x11/0x20
  super_cache_count+0x35/0xc0
  shrink_slab.part.66+0xb1/0x370
  shrink_node+0x7e/0x1a0
  try_to_free_pages+0x199/0x470
  __alloc_pages_slowpath+0x3a1/0xd20
  __alloc_pages_nodemask+0x1c3/0x200
  cache_grow_begin+0x20b/0x2e0
  fallback_alloc+0x160/0x200
  kmem_cache_alloc+0x111/0x4e0

The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work. Essentially we are exposed to
both use-after-free and use-before-initialisation bugs here.

To fix this, add a check for the SB_BORN flag in super_cache_count.
In general, this flag is not set until ->fs_mount() completes
successfully, so we know that it is set after the filesystem
setup has completed. This matches the trylock_super() behaviour
which will not let super_cache_scan() run if SB_BORN is not set, and
hence will not allow the superblock shrinker from entering the
filesystem while it is being set up or after it has failed setup
and is being torn down.

Cc: stable@xxxxxxxxxx
Signed-Off-By: Dave Chinner <dchinner@xxxxxxxxxx>
Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxxxxxxxxxx>
---
 fs/super.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 7e9beab77259..abe2541fb28c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -119,13 +119,23 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	/*
-	 * Don't call trylock_super as it is a potential
-	 * scalability bottleneck. The counts could get updated
-	 * between super_cache_count and super_cache_scan anyway.
-	 * Call to super_cache_count with shrinker_rwsem held
-	 * ensures the safety of call to list_lru_shrink_count() and
-	 * s_op->nr_cached_objects().
+	 * We don't call trylock_super() here as it is a scalability bottleneck,
+	 * so we're exposed to partial setup state. The shrinker rwsem does not
+	 * protect filesystem operations backing list_lru_shrink_count() or
+	 * s_op->nr_cached_objects(). Counts can change between
+	 * super_cache_count and super_cache_scan, so we really don't need locks
+	 * here.
+	 *
+	 * However, if we are currently mounting the superblock, the underlying
+	 * filesystem might be in a state of partial construction and hence it
+	 * is dangerous to access it.  trylock_super() uses a MS_BORN check to
+	 * avoid this situation, so do the same here. The memory barrier is
+	 * matched with the one in mount_fs() as we don't hold locks here.
 	 */
+	if (!(sb->s_flags & MS_BORN))
+		return 0;
+	smp_rmb();
+
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
@@ -1193,6 +1203,14 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
 	sb = root->d_sb;
 	BUG_ON(!sb);
 	WARN_ON(!sb->s_bdi);
+
+	/*
+	 * Write barrier is for super_cache_count(). We place it before setting
+	 * MS_BORN as the data dependency between the two functions is the
+	 * superblock structure contents that we just set up, not the MS_BORN
+	 * flag.
+	 */
+	smp_wmb();
 	sb->s_flags |= MS_BORN;
 
 	error = security_sb_kern_mount(sb, flags, secdata);
-- 
2.19.1.3.ge56e4f7