Re: freezing system for several second on high I/O [kernel 4.15]

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 6 Feb 2018 17:08:40 +1100

On Tue, Feb 06, 2018 at 08:47:55AM +0500, mikhail wrote:
> On Wed, 2018-01-31 at 13:22 +1100, Dave Chinner wrote:
> > 
> > 
> > Could be a disk that is slow, or could be many other
> > things. More information required:
> > 
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> > 
> > And everything else is backed up behind it trying to allocate
> > inodes. There could be many, many reasons for that, and that's why
> > we need more information to begin to isolate the cause.
> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
> 
> 
> > # trace-cmd record -e xfs\*
> > before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:
> > 
> > # trace-cmd report > trace_report.txt
> 
> https://dumps.sy24.ru/trace_report.txt.bz2 (860MB)
> 
> Ok, I collected needed information.

You collected a trace of something, but didn't supply any of the
other storage and fs config stuff that was mentioned in that link.

> Cant you now look into in?

I don't see a filesystem problem from the log you've posted, I see
some slow IO a minute after boot, then a lockdep false positive
about 40mins in, but the system then reports GPU memory allocation
problems and hardware MCEs for the next 4-5 hours before the GPU
appears to stop working.

[....]

> [    4.687255] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

I'm guessing that you have an ssd with ext4 and two 4TB drives.
And ext4 is on the SSD?

> [    5.778628] EXT4-fs (sda1): re-mounted. Opts: (null)
.....
> [    7.918812] XFS (sdb): Mounting V5 Filesystem
> [    8.123854] XFS (sdb): Starting recovery (logdev: internal)

And there's an XFS filesystem on one drive...

> [   77.459679] sysrq: SysRq : Show Blocked State
> [   77.459693]   task                        PC stack   pid father
> [   77.459947] tracker-store   D12296  2469   1847 0x00000000
> [   77.459957] Call Trace:
> [   77.459963]  __schedule+0x2dc/0xba0
> [   77.459966]  ? _raw_spin_unlock_irq+0x2c/0x40
> [   77.459970]  schedule+0x33/0x90
> [   77.459974]  io_schedule+0x16/0x40
> [   77.459978]  generic_file_read_iter+0x3b8/0xe10
> [   77.459986]  ? page_cache_tree_insert+0x140/0x140
> [   77.460026]  xfs_file_buffered_aio_read+0x6e/0x1a0 [xfs]
> [   77.460054]  xfs_file_read_iter+0x68/0xc0 [xfs]
> [   77.460058]  __vfs_read+0xf1/0x160
> [   77.460065]  vfs_read+0xa3/0x150
> [   77.460069]  SyS_pread64+0x98/0xc0
> [   77.460074]  entry_SYSCALL_64_fastpath+0x1f/0x96

That's waiting on a read IO - no indication of anything being wrong
here....

> [ 2095.241660] TaskSchedulerFo (16168) used greatest stack depth: 10232 bytes left
> 
> [ 2173.204790] ======================================================
> [ 2173.204791] WARNING: possible circular locking dependency detected
> [ 2173.204793] 4.15.0-rc4-amd-vega+ #8 Not tainted
> [ 2173.204794] ------------------------------------------------------
> [ 2173.204795] gnome-shell/1971 is trying to acquire lock:
> [ 2173.204796]  (sb_internal){.+.+}, at: [<00000000221fd49d>] xfs_trans_alloc+0xec/0x130 [xfs]
> [ 2173.204832] 
>                but task is already holding lock:
> [ 2173.204833]  (fs_reclaim){+.+.}, at: [<00000000bdc32871>] fs_reclaim_acquire.part.74+0x5/0x30
> [ 2173.204837] 
>                which lock already depends on the new lock.

And here we go again on another lockdep memory-reclaim false positive
whack-a-mole game.

> [ 2173.204838] 
>                the existing dependency chain (in reverse order) is:
> [ 2173.204839] 
>                -> #1 (fs_reclaim){+.+.}:
> [ 2173.204843]        fs_reclaim_acquire.part.74+0x29/0x30
> [ 2173.204844]        fs_reclaim_acquire+0x19/0x20
> [ 2173.204846]        kmem_cache_alloc+0x33/0x300
> [ 2173.204870]        kmem_zone_alloc+0x6c/0xf0 [xfs]
> [ 2173.204891]        xfs_trans_alloc+0x6b/0x130 [xfs]
> [ 2173.204912]        xfs_efi_recover+0x11c/0x1c0 [xfs]
> [ 2173.204932]        xlog_recover_process_efi+0x41/0x60 [xfs]
> [ 2173.204951]        xlog_recover_process_intents.isra.40+0x138/0x270 [xfs]
> [ 2173.204969]        xlog_recover_finish+0x23/0xb0 [xfs]
> [ 2173.204987]        xfs_log_mount_finish+0x61/0xe0 [xfs]
> [ 2173.205005]        xfs_mountfs+0x657/0xa60 [xfs]
> [ 2173.205022]        xfs_fs_fill_super+0x4aa/0x630 [xfs]
> [ 2173.205024]        mount_bdev+0x184/0x1c0
> [ 2173.205042]        xfs_fs_mount+0x15/0x20 [xfs]
> [ 2173.205043]        mount_fs+0x32/0x150
> [ 2173.205045]        vfs_kern_mount.part.25+0x5d/0x160
> [ 2173.205046]        do_mount+0x65d/0xde0
> [ 2173.205047]        SyS_mount+0x98/0xe0
> [ 2173.205049]        do_syscall_64+0x6c/0x220
> [ 2173.205052]        return_from_SYSCALL_64+0x0/0x75
> [ 2173.205053] 
>                -> #0 (sb_internal){.+.+}:
> [ 2173.205056]        lock_acquire+0xa3/0x1f0
> [ 2173.205058]        __sb_start_write+0x11c/0x190
> [ 2173.205075]        xfs_trans_alloc+0xec/0x130 [xfs]
> [ 2173.205091]        xfs_free_eofblocks+0x12a/0x1e0 [xfs]
> [ 2173.205108]        xfs_inactive+0xf0/0x110 [xfs]
> [ 2173.205125]        xfs_fs_destroy_inode+0xbb/0x2d0 [xfs]
> [ 2173.205127]        destroy_inode+0x3b/0x60
> [ 2173.205128]        evict+0x13e/0x1a0
> [ 2173.205129]        dispose_list+0x56/0x80
> [ 2173.205131]        prune_icache_sb+0x5a/0x80
> [ 2173.205132]        super_cache_scan+0x137/0x1b0
> [ 2173.205134]        shrink_slab.part.47+0x1fb/0x590
> [ 2173.205135]        shrink_slab+0x29/0x30
> [ 2173.205136]        shrink_node+0x11e/0x2f0
> [ 2173.205137]        do_try_to_free_pages+0xd0/0x350
> [ 2173.205138]        try_to_free_pages+0x136/0x340
> [ 2173.205140]        __alloc_pages_slowpath+0x487/0x1150
> [ 2173.205141]        __alloc_pages_nodemask+0x3a8/0x430
> [ 2173.205143]        dma_generic_alloc_coherent+0x91/0x160
> [ 2173.205146]        x86_swiotlb_alloc_coherent+0x25/0x50
> [ 2173.205150]        ttm_dma_pool_get_pages+0x230/0x630 [ttm]

OK, new symptom of the ages old problem with using lockdep for
annotating things that are not locks. In this case, it's both
memory reclaim and filesystem freeze annotations that are colliding
with an XFS function that can be called above and below memory
allocation and producing a false positive.

i.e. it's perfectly safe for us to call xfs_trans_alloc() in the
manner we are from memory reclaim because we're not in a GFP_NOFS or
PF_MEMALLOC_NOFS context.

And it's also perfectly safe for us to call xfs_trans_alloc from log
recovery at mount time like we are because the filesystem cannot be
frozen before a mount is complete and hence sb_internal ordering is
completely irrelevant at that point.

So it's a false positive, and I don't think there's anything we can
do to prevent it because using __GFP_NOLOCKDEP in xfs_trans_alloc()
will mean lockdep will not warn we we have a real deadlock due to
transaction nesting in memory reclaim contexts.....

>From here, there's nothing filesystem related in the logs:

[.....]

> [ 2229.274826] swiotlb: coherent allocation failed for device 0000:07:00.0 size=2097152

You are getting gpu memory allocation failures....

> [ 2234.832320] amdgpu 0000:07:00.0: swiotlb buffer is full (sz: 2097152 bytes)
> [ 2234.832325] swiotlb: coherent allocation failed for device 0000:07:00.0 size=2097152

repeatedly, until ....

> [ 2938.815747] mce: [Hardware Error]: Machine check events logged

your hardware starts throwing errors at the CPU.

> [ 2999.259697] kworker/dying (16220) used greatest stack depth: 9808 bytes left
> [ 3151.714448] perf: interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
> [ 5331.990934] TCP: request_sock_TCP: Possible SYN flooding on port 8201. Sending cookies.  Check SNMP counters.
> [ 5331.991837] TCP: request_sock_TCP: Possible SYN flooding on port 9208. Sending cookies.  Check SNMP counters.
> [ 5334.781978] TCP: request_sock_TCP: Possible SYN flooding on port 7171. Sending cookies.  Check SNMP counters.

other bad things are happening to your machine....

> [ 5354.636542] sbis3plugin[29294]: segfault at 8 ip 000000001c84acaf sp 0000000038851b8c error 4 in libQt5Core.so[7f87ee665000+5a3000]
> [ 5794.612947] perf: interrupt took too long (3152 > 3151), lowering kernel.perf_event_max_sample_rate to 63000
> [ 6242.114852] amdgpu 0000:07:00.0: swiotlb buffer is full (sz: 2097152 bytes)
> [ 6242.114857] swiotlb: coherent allocation failed for device 0000:07:00.0 size=2097152

Random userspace segfaults and more gpu memory allocation failures

> [ 9663.267767] mce: [Hardware Error]: Machine check events logged
> [10322.649619] mce: [Hardware Error]: Machine check events logged
> [10557.294312] amdgpu 0000:07:00.0: swiotlb buffer is full (sz: 2097152 bytes)
> [10557.294317] swiotlb: coherent allocation failed for device 0000:07:00.0 size=2097152

more hardware and gpu memory allocation failures

[more gpu memalloc failures]

> [13609.734065] mce: [Hardware Error]: Machine check events logged
> [13920.399283] mce: [Hardware Error]: Machine check events logged
> [14116.872461] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1653418, last emitted seq=1653420
> [14116.872466] [drm] No hardware hang detected. Did some blocks stall?

And finally after 4+ hours hardware errors and the GPU times out and
drm is confused....

So there doesn't appear to be any filesystem problem here, just a
heavily loaded system under memory pressure....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>