On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote: > On 08/10/2012 05:02 AM, Mel Gorman wrote: > >On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: > > >>> > >>>Ok, this is an untested hack and I expect it would drop allocation success > >>>rates again under load (but not as much). Can you test again and see what > >>>effect, if any, it has please? > >>> > >>>---8<--- > >>>mm: compaction: back out if contended > >>> > >>>--- > >> > >><snip> > >> > >>Initial testing with this patch looks very good from > >>my perspective; CPU utilization stays reasonable, > >>write-out rate stays high, no signs of stress. > >>Here's an example after ~10 minutes under my test load: > >> > > Hmmm, I wonder if I should have tested this patch longer, > in view of the trouble I ran into testing the new patch? > See below. > The two patches are quite different in what they do. I think it's unlikely they would share a common bug. > > <SNIP> > >---8<--- > >mm: compaction: Abort async compaction if locks are contended or taking too long > > > Hmmm, while testing this patch, a couple of my servers got > stuck after ~30 minutes or so, like this: > > [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. > [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2515.884447] ceph-osd D 0000000000000000 0 30375 1 0x00000000 > [ 2515.891531] ffff8802e1a99e38 0000000000000082 ffff88056b38e298 ffff8802e1a99fd8 > [ 2515.899013] ffff8802e1a98010 ffff8802e1a98000 ffff8802e1a98000 ffff8802e1a98000 > [ 2515.906482] ffff8802e1a99fd8 ffff8802e1a98000 ffff880697d31700 ffff8802e1a84500 > [ 2515.913968] Call Trace: > [ 2515.916433] [<ffffffff8147fded>] schedule+0x5d/0x60 > [ 2515.921417] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140 > [ 2515.927938] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20 > [ 2515.934195] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20 > [ 2515.940934] [<ffffffff8147edc5>] ? down_write+0x45/0x50 > [ 2515.946244] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240 > [ 2515.951640] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b > <SNIP> > > I tried to capture a perf trace while this was going on, but it > never completed. "ps" on this system reports lots of kernel threads > and some user-space stuff, but hangs part way through - no ceph > executables in the output, oddly. > ps is probably locking up because it's trying to access a proc file for a process that is not releasing the mmap_sem. > I can retest your earlier patch for a longer period, to > see if it does the same thing, or I can do some other thing > if you tell me what it is. > > Also, FWIW I sorted a little through SysRq-T output from such > a system; these bits looked interesting: > > [ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17} (t=60000 jiffies) > [ 3663.685099] sending NMI to all CPUs: > [ 3663.685101] NMI backtrace for cpu 0 > [ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan] > [ 3663.685138] > [ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH > [ 3663.685142] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60 > [ 3663.685148] RSP: 0018:ffff880a08191898 EFLAGS: 00000012 > [ 3663.685149] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000c5 > [ 3663.685149] RDX: 00000000000000bf RSI: 000000000000015a RDI: ffff88063fffcb00 > [ 3663.685150] RBP: ffff880a081918a8 R08: 0000000000000000 R09: 0000000000000000 > [ 3663.685151] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246 > [ 3663.685152] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680 > [ 3663.685153] FS: 00007fff90ae0700(0000) GS:ffff880627c00000(0000) knlGS:0000000000000000 > [ 3663.685154] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 3663.685155] CR2: ffffffffff600400 CR3: 00000002b8fbe000 CR4: 00000000000007f0 > [ 3663.685156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 3663.685157] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 3663.685158] Process ceph-osd (pid: 100027, threadinfo ffff880a08190000, task ffff880a9a29ae00) > [ 3663.685158] Stack: > [ 3663.685159] 000000000000130a 0000000000000000 ffff880a08191948 ffffffff8111a760 > [ 3663.685162] ffffffff81a13420 0000000000000009 ffffea000004c240 0000000000000000 > [ 3663.685165] ffff88063fffcba0 000000003fffcb98 ffff880a08191a18 0000000000001600 > [ 3663.685168] Call Trace: > [ 3663.685169] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0 > [ 3663.685173] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330 > [ 3663.685175] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0 > [ 3663.685178] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100 > [ 3663.685180] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110 > [ 3663.685182] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250 > [ 3663.685187] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750 > [ 3663.685190] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0 > [ 3663.685192] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150 > [ 3663.685195] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0 > [ 3663.685199] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320 > [ 3663.685202] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30 > [ 3663.685205] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0 > [ 3663.685208] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c > [ 3663.685211] [<ffffffff8148154f>] page_fault+0x1f/0x30 I went through the patch again but only found the following which is a weak candidate. Still, can you retest with the following patch on top and CONFIG_PROVE_LOCKING set please? ---8<--- diff --git a/mm/compaction.c b/mm/compaction.c index 1827d9a..d4a51c6 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -64,7 +64,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, { if (need_resched() || spin_is_contended(lock)) { if (locked) { - spin_unlock_irq(lock); + spin_unlock_irqrestore(lock, *flags); locked = false; } @@ -276,8 +276,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc) list_for_each_entry(page, &cc->migratepages, lru) count[!!page_is_file_cache(page)]++; - __mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]); - __mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]); + mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]); + mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]); } /* Similar to reclaim, but different enough that they don't share logic */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>