On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote: > On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote: > > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote: > > > Hi, > > > > > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote: > > > >> Thanks Dave, > > > >> > > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI > > > >> modules loaded (virtio block) so there's something else going on in the > > > >> current merge window. I'll keep an eye on it and make sure there's > > > >> nothing iSCSI needs fixing for. > > > > > > > > OK, so before this slips through the cracks..... > > > > > > > > Linus - your tree as of a few minutes ago still panics immediately > > > > when starting xfstests on iscsi devices. It appears to be a > > > > scatterlist corruption and not an iscsi problem, so the iscsi guys > > > > seem to have bounced it and no-one is looking at it. > > > > > > Hmm. There's not much to go by. > > > > > > Can somebody in iscsi-land please try to just bisect it - I'm not > > > seeing a lot of clues to where this comes from otherwise. > > > > Yeah, my hopes of this being quickly resolved by someone else didn't > > work out and whatever is going on in that test VM is looking like a > > different kind of odd. I'm saving that off for later, and seeing if I > > can't be a bisect on the iSCSI issue. > > There may be deeper issues. I just started running scalability tests > (e.g. 16-way fsmark create tests) and about a minute in I got a > directory corruption reported - something I hadn't seen in the dev > cycle at all. I unmounted the fs, mkfs'd it again, ran the > workload again and about a minute in this fired: > > [628867.607417] ------------[ cut here ]------------ > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220 > [628867.610702] Modules linked in: > [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: G W 4.9.0-dgc #18 > [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 > [628867.616179] Workqueue: events rht_deferred_worker > [628867.632422] Call Trace: > [628867.634691] dump_stack+0x63/0x83 > [628867.637937] __warn+0xcb/0xf0 > [628867.641359] warn_slowpath_null+0x1d/0x20 > [628867.643362] shadow_lru_isolate+0x171/0x220 > [628867.644627] __list_lru_walk_one.isra.11+0x79/0x110 > [628867.645780] ? __list_lru_init+0x70/0x70 > [628867.646628] list_lru_walk_one+0x17/0x20 > [628867.647488] scan_shadow_nodes+0x34/0x50 > [628867.648358] shrink_slab.part.65.constprop.86+0x1dc/0x410 > [628867.649506] shrink_node+0x57/0x90 > [628867.650233] do_try_to_free_pages+0xdd/0x230 > [628867.651157] try_to_free_pages+0xce/0x1a0 > [628867.652342] __alloc_pages_slowpath+0x2df/0x960 > [628867.653332] ? __might_sleep+0x4a/0x80 > [628867.654148] __alloc_pages_nodemask+0x24b/0x290 > [628867.655237] kmalloc_order+0x21/0x50 > [628867.656016] kmalloc_order_trace+0x24/0xc0 > [628867.656878] __kmalloc+0x17d/0x1d0 > [628867.657644] bucket_table_alloc+0x195/0x1d0 > [628867.658564] ? __might_sleep+0x4a/0x80 > [628867.659449] rht_deferred_worker+0x287/0x3c0 > [628867.660366] ? _raw_spin_unlock_irq+0xe/0x30 > [628867.661294] process_one_work+0x1de/0x4d0 > [628867.662208] worker_thread+0x4b/0x4f0 > [628867.662990] kthread+0x10c/0x140 > [628867.663687] ? process_one_work+0x4d0/0x4d0 > [628867.664564] ? kthread_create_on_node+0x40/0x40 > [628867.665523] ret_from_fork+0x25/0x30 > [628867.666317] ---[ end trace 7c38634006a9955e ]--- > > Now, this workload does not touch the page cache at all - it's > entirely an XFS metadata workload, so it should not really be > affecting the working set code. The system back up, and I haven't reproduced this problem yet. However, benchmark results are way off where they should be, and at times the performance is utterly abysmal. The XFS for-next tree based on the 4.9 kernel shows none of these problems, so I don't think there's an XFS problem here. Workload is the same 16-way fsmark workload that I've been using for years as a performance regression test. The workload normally averages around 230k files/s - i'm seeing and average of ~175k files/s on you current kernel. And there are periods where performance just completely tanks: # ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15 # Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 0 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. FSUse% Count Size Files/sec App Overhead 0 1600000 0 256964.5 10696017 0 3200000 0 244151.9 12129380 0 4800000 0 239929.1 11724413 0 6400000 0 235646.5 11998954 0 8000000 0 162537.3 15027053 0 9600000 0 186597.4 17957243 ..... 0 43200000 0 184972.6 27911543 0 44800000 0 141642.2 62862142 0 46400000 0 137700.8 39018008 0 48000000 0 92155.0 38277234 0 49600000 0 57053.8 27810215 0 51200000 0 178941.6 33321543 This sort of thing is normally indicative of a memory reclaim or lock contention problem. Profile showed unusual spinlock contention, but then I realised there was only one kswapd thread running. Yup, sure enough, it's caused by a major change in memory reclaim behaviour: [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff] [ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff] [ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff] [ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff] the numa=fake=4 CLI option is broken. And, just to pile another one in - there's a massive xfs_repair performance regression, too. It normally takes 4m30s to run over the 50 million inode filesystem created by fsmark. on 4.10: XFS_REPAIR Summary Thu Dec 22 16:43:58 2016 Phase Start End Duration Phase 1: 12/22 16:35:36 12/22 16:35:38 2 seconds Phase 2: 12/22 16:35:38 12/22 16:35:40 2 seconds Phase 3: 12/22 16:35:40 12/22 16:39:00 3 minutes, 20 seconds Phase 4: 12/22 16:39:00 12/22 16:40:40 1 minute, 40 seconds Phase 5: 12/22 16:40:40 12/22 16:40:40 Phase 6: 12/22 16:40:40 12/22 16:43:57 3 minutes, 17 seconds Phase 7: 12/22 16:43:57 12/22 16:43:57 Total run time: 8 minutes, 21 seconds So, yeah, there's lots of bad stuff I'm tripping over right now. This is all now somebody els's problem to deal with - by this time tomorrow I'm outta here and won't be back for several months.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html