+ mm-fix-list-corruptions-on-shmem-shrinklist.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Fri, 04 Aug 2017 16:19:14 -0700

The patch titled
     Subject: mm: fix list corruptions on shmem shrinklist
has been added to the -mm tree.  Its filename is
     mm-fix-list-corruptions-on-shmem-shrinklist.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-fix-list-corruptions-on-shmem-shrinklist.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-fix-list-corruptions-on-shmem-shrinklist.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Cong Wang <xiyou.wangcong@xxxxxxxxx>
Subject: mm: fix list corruptions on shmem shrinklist

We saw many list corruption warnings on shmem shrinklist:

 [45480.300911] ------------[ cut here ]------------
 [45480.305558] WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
 [45480.313622] list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
 [45480.322435] Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
 [45480.357776] CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
 [45480.365679] Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
 [45480.373416]  ffffb13c03ccbaf8 ffffffff9e36bc87 ffffb13c03ccbb48 0000000000000000
 [45480.380940]  ffffb13c03ccbb38 ffffffff9e08511b 0000003b7fffc000 0000000000000002
 [45480.388392]  ffff9ae5699ba960 ffffb13c03ccbbe8 ffffb13c03ccbbf8 ffff9ae5694b82d8
 [45480.395893] Call Trace:
 [45480.398214]  [<ffffffff9e36bc87>] dump_stack+0x4d/0x66
 [45480.403481]  [<ffffffff9e08511b>] __warn+0xcb/0xf0
 [45480.408322]  [<ffffffff9e08518f>] warn_slowpath_fmt+0x4f/0x60
 [45480.414095]  [<ffffffff9e38a6fe>] __list_del_entry+0x9e/0xc0
 [45480.419831]  [<ffffffff9e1a33aa>] shmem_unused_huge_shrink+0xfa/0x2e0
 [45480.426269]  [<ffffffff9e1a35b0>] shmem_unused_huge_scan+0x20/0x30
 [45480.432382]  [<ffffffff9e20a0d3>] super_cache_scan+0x193/0x1a0
 [45480.438238]  [<ffffffff9e19a9c3>] shrink_slab.part.41+0x1e3/0x3f0
 [45480.444370]  [<ffffffff9e19abf9>] shrink_slab+0x29/0x30
 [45480.449610]  [<ffffffff9e19ed39>] shrink_node+0xf9/0x2f0
 [45480.454858]  [<ffffffff9e19fbd8>] kswapd+0x2d8/0x6c0
 [45480.459896]  [<ffffffff9e19f900>] ? mem_cgroup_shrink_node+0x140/0x140
 [45480.466337]  [<ffffffff9e0a3b87>] kthread+0xd7/0xf0
 [45480.471231]  [<ffffffff9e0b519e>] ? vtime_account_idle+0xe/0x50
 [45480.477282]  [<ffffffff9e0a3ab0>] ? kthread_park+0x60/0x60
 [45480.482820]  [<ffffffff9e6d4c52>] ret_from_fork+0x22/0x30
 [45480.488234] ---[ end trace 66841eda03a967a0 ]---
 [45480.492834] ------------[ cut here ]------------
 [45480.497432] WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
 [45480.505020] list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
 [45480.516716] Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
 [45480.551020] CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
 [45480.560706] Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
 [45480.568299]  ffffb13c04913b30 ffffffff9e36bc87 ffffb13c04913b80 0000000000000000
 [45480.575628]  ffffb13c04913b70 ffffffff9e08511b 00000021699ba900 ffff9ae5694b82d8
 [45480.583080]  ffff9ae5694b82d8 ffff9ae5699ba960 ffff9ae5699ba900 0000000000000000
 [45480.590560] Call Trace:
 [45480.592937]  [<ffffffff9e36bc87>] dump_stack+0x4d/0x66
 [45480.598144]  [<ffffffff9e08511b>] __warn+0xcb/0xf0
 [45480.602978]  [<ffffffff9e08518f>] warn_slowpath_fmt+0x4f/0x60
 [45480.608718]  [<ffffffff9e38a639>] __list_add+0x89/0xb0
 [45480.613785]  [<ffffffff9e1a55d4>] shmem_setattr+0x204/0x230
 [45480.619340]  [<ffffffff9e2232ef>] notify_change+0x2ef/0x440
 [45480.624929]  [<ffffffff9e203bad>] do_truncate+0x5d/0x90
 [45480.630184]  [<ffffffff9e20393a>] ? do_dentry_open+0x27a/0x310
 [45480.635974]  [<ffffffff9e214101>] path_openat+0x331/0x1190
 [45480.641549]  [<ffffffff9e21680e>] do_filp_open+0x7e/0xe0
 [45480.646791]  [<ffffffff9e1bb3f4>] ? handle_mm_fault+0xa54/0x1340
 [45480.652888]  [<ffffffff9e1e6703>] ? kmem_cache_alloc+0xd3/0x1a0
 [45480.658778]  [<ffffffff9e215927>] ? getname_flags+0x37/0x190
 [45480.664527]  [<ffffffff9e22423f>] ? __alloc_fd+0x3f/0x170
 [45480.669918]  [<ffffffff9e205023>] do_sys_open+0x123/0x200
 [45480.675339]  [<ffffffff9e20511e>] SyS_open+0x1e/0x20
 [45480.680216]  [<ffffffff9e002aa1>] do_syscall_64+0x61/0x170
 [45480.685805]  [<ffffffff9e6d4ac6>] entry_SYSCALL64_slow_path+0x25/0x25
 [45480.692255] ---[ end trace 66841eda03a967a1 ]---
 [45480.696823] ------------[ cut here ]------------

The problem is that shmem_unused_huge_shrink() moves entries from the
global sbinfo->shrinklist to its local lists and then releases the
spinlock.  However, a parallel shmem_setattr() could access one of these
entries directly and add it back to the global shrinklist if it is
removed, with the spinlock held.

The logic itself looks solid since an entry could be either in a local
list or the global list, otherwise it is removed from one of them by
list_del_init().  So probably the race condition is that, one CPU is in
the middle of INIT_LIST_HEAD() but the other CPU calls list_empty() which
returns true too early then the following list_add_tail() sees a corrupted
entry.

list_empty_careful() is designed to fix this situation.

Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@xxxxxxxxx
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Cong Wang <xiyou.wangcong@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/shmem.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN mm/shmem.c~mm-fix-list-corruptions-on-shmem-shrinklist mm/shmem.c

--- a/mm/shmem.c~mm-fix-list-corruptions-on-shmem-shrinklist
+++ a/mm/shmem.c
@@ -1022,7 +1022,7 @@ static int shmem_setattr(struct dentry *
 			 */
 			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
 				spin_lock(&sbinfo->shrinklist_lock);
-				if (list_empty(&info->shrinklist)) {
+				if (list_empty_careful(&info->shrinklist)) {
 					list_add_tail(&info->shrinklist,
 							&sbinfo->shrinklist);
 					sbinfo->shrinklist_len++;
@@ -1817,7 +1817,7 @@ alloc_nohuge:		page = shmem_alloc_and_ac
 			 * to shrink under memory pressure.
 			 */
 			spin_lock(&sbinfo->shrinklist_lock);
-			if (list_empty(&info->shrinklist)) {
+			if (list_empty_careful(&info->shrinklist)) {
 				list_add_tail(&info->shrinklist,
 						&sbinfo->shrinklist);
 				sbinfo->shrinklist_len++;
_

Patches currently in -mm which might be from xiyou.wangcong@xxxxxxxxx are

mm-fix-list-corruptions-on-shmem-shrinklist.patch