On Fri, Aug 05, 2011 at 08:30:44PM +0800, Xiaotian Feng wrote: > On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@xxxxxxxxx> wrote: > > On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@xxxxxxx> wrote: > >> (Adding patch author to cc) > >> > >> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote: > >>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@xxxxxxxxx> wrote: > >>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@xxxxxxx> wrote: > >>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: > >>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > >>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@xxxxxxxxx> wrote: > >>> >>> > > >>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > >>> >>> >> was trying to build my kernel. The photo of crash screen and my config > >>> >>> >> is attached. > >>> >>> > > >>> >>> > hm, now why has that started happening? > >>> >>> > > >>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down? > >>> >>> > > >>> >>> > >>> >>> I will try it then, but it isn't very reproducible :( > >>> >>> But my system hung after some list corruption warnings... I hit the > >>> >>> corruption 4 times... > >>> >>> > >>> >> > >>> >> That is very unexpected but if lists are being corrupted, it could > >>> >> explain the previously reported bug as that bug looked like an active > >>> >> page on an inactive list. > >>> >> > >>> >> What was the last working kernel? Can you bisect? > >>> >> > >>> >>> [ 1220.468089] ------------[ cut here ]------------ > >>> >>> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() > >>> >>> [ 1220.468102] Hardware name: 42424XC > >>> >>> [ 1220.468104] list_del corruption. next->prev should be > >>> >>> ffffea0000e069a0, but was ffff880100216c78 > >>> >>> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables > >>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 > >>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp > >>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc > >>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi > >>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd > >>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer > >>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd > >>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit > >>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev > >>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e > >>> >>> sdhci_pci sdhci crc_itu_t > >>> >>> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 > >>> >>> [ 1220.468188] Call Trace: > >>> >>> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 > >>> >>> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 > >>> >>> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 > >>> >>> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 > >>> >>> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 > >>> >>> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 > >>> >>> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 > >>> >>> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 > >>> >>> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de > >>> >>> > >>> >> > >>> > > >>> > I'm hitting this again today, when I'm trying to rebuild my kernel.... > >>> > Looking it a bit > >>> > > >>> > list_del corruption. next->prev should be ffffea0000e069a0, but was > >>> > ffff880100216c78 > >>> > > >>> > I find something interesting from my syslog: > >>> > > >>> > PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 > >>> > > >>> >> This warning and the page reclaim warning are on paths that are > >>> >> commonly used and I would expect to see multiple reports. I wonder > >>> >> what is happening on your machine that is so unusual. > >>> >> > >>> >> Have you run memtest on this machine for a few hours and badblocks > >>> >> on the disk to ensure this is not hardware trouble? > >>> >> > >>> >>> So is it possible that my previous BUG is triggered by slab list corruption? > >>> >> > >>> >> Not directly, but clearly there is something very wrong. > >>> >> > >>> >> If slub corruption reports are very common and kernel 3.0 is fine, my > >>> >> strongest candidate for the corruption would be the SLUB lockless > >>> >> patches. Try > >>> >> > >>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R > >>> >> > >>> > > >>> > >>> Here's a update for the results: > >>> > >>> 3.0.0-rc7: running for hours without a crash > >>> upstream kernel: list corruption happened while building kernel within > >>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well) > >>> upstream kernel with above revert: running for hours without a crash > >>> > >>> Trying to bisect but rebuild is slow .... > >>> > >> > >> If you have not done so already, I strongly suggest your bisection > >> starts within that range of patches to isolate which one is at fault. > >> It'll cut down on the number of builds you need to do. Thanks for > >> testing. > >> > > > > This is interesting, I just change as following: > > > > diff --git a/mm/slub.c b/mm/slub.c > > index eb5a8f9..616b78e 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s, > > gfp_t gfpflags, int node, > > "__slab_alloc")); > > > > if (unlikely(!object)) { > > - c->page = NULL; > > + //c->page = NULL; > > stat(s, DEACTIVATE_BYPASS); > > + deactivate_slab(s, c); > > goto new_slab; > > } > > > > Then my system doesn't print any list corruption warnings and my build > > success then. So this means revert of 03e404af2 could cure this. > > I'll do more test next week to see if the list corruption still exist, thanks. > > > > Sorry, please ignore it... My system corrupted before I went to leave .... > Please continue the bisection in that case and establish for sure if the problem is in that series or not. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>