Hi Dave,
First, to answer your questions:
- this filesystem is accessed only through NFS, so I guess we can say this
is a "dedicated" NFS server
- yes, both NFSv3 and NFSv4 are used (by different servers), and definitely
they may attempt to access the same file
I did some further debugging by patching XFS to get some more info on how
much memory is requested, and who requests it. I added a kmem_realloc
variant that is called only by xfs_iext_realloc_indirect. This new function
(kmem_realloc_xfs_iext_realloc_indirect) also prints the inode number, which
needs the memory. This is after some fiddling to check that "ifp" is the one
embedded in xfs_inode and not the one allocated through kmem_cache (attr
fork). After memory allocation fails for the first time, I also call
handle_sysrq('m'). Finally, I print after how many retries the allocation
succeeded. I also added some info to the normal kmem_alloc function.
Results are following:
- memory allocation failures happened only on the
kmem_realloc_xfs_iext_realloc_indirect path for now
- XFS hits memory re-allocation failures when it needs to allocate about
35KB. Sometimes allocation succeeds after few retries, but sometimes it
takes several thousands of retries.
- All allocation failures happened on NFSv3 paths
- Three inode numbers were reported as failing memory allocations. After
several hours, "find -inum" is still searching for these inodes...this is a
huge filesystem... Is there any other quicker (XFS-specific?) way to find
the file based on inode number?
Please see a typical allocation failure in [1]
Any recommendation how to move forward with this issue?
Additional observation that I saw in my local system: writing files to XFS
locally vs writing the same files via NFS (both 3 and 4), the amount of
extents reported by "xfs_bmap" is much higher for the NFS case. For example,
creating a new file and writing into it as follows:
- write 4KB
- skip 4KB (i.e., lseek to 4KB + 4KB)
- write 4KB
- skip 4KB
...
Create a file of say 50MB this way.
Locally it ends up with very few (1-5) extents. But same exact workload
through NFS results in several thousands of extents. The filesystem is
mounted as "sync" in both cases. This is currently just an observation,
which I will try to debug further. Are you perhaps familiar with such
behavior? Can you perhaps try and check whether it also happens in your
environment?
Thanks for your help,
Alex.
[1]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.497033] XFS: pid=12642
kmem_alloc failure inum=35646862875 size=35584 flags=0x4 lflags=0x250
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498407] Pid: 12642, comm:
nfsd Tainted: GF O 3.8.13-030813-generic #201305111843
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498410] Call Trace:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498450]
[<ffffffffa07a832e>] kmem_alloc_xfs_iext_realloc_indirect+0x16e/0x1f0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498465]
[<ffffffffa07a86b3>] kmem_realloc_xfs_iext_realloc_indirect+0x33/0x80 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498482]
[<ffffffffa07db2a0>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498498]
[<ffffffffa07db2ff>] xfs_iext_irec_new+0x3f/0x180 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498516]
[<ffffffffa07db5b9>] xfs_iext_add_indirect_multi+0x179/0x2b0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498532]
[<ffffffffa07db9be>] xfs_iext_add+0xce/0x290 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498539]
[<ffffffff81076634>] ? wake_up_worker+0x24/0x30
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498553]
[<ffffffffa07dbbd2>] xfs_iext_insert+0x52/0x100 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498569]
[<ffffffffa07b6ed3>] ? xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498583]
[<ffffffffa07b6ed3>] xfs_bmap_add_extent_hole_delay+0xd3/0x6a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498598]
[<ffffffffa07da3fc>] ? xfs_iext_bno_to_ext+0x8c/0x160 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498614]
[<ffffffffa07b7719>] xfs_bmapi_reserve_delalloc+0x279/0x2a0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498628]
[<ffffffffa07be742>] xfs_bmapi_delay+0x122/0x270 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498642]
[<ffffffffa079d703>] xfs_iomap_write_delay+0x173/0x320 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498656]
[<ffffffffa07be42c>] ? xfs_bmapi_read+0xfc/0x2f0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498660]
[<ffffffff8135d8f3>] ? call_rwsem_down_write_failed+0x13/0x20
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498665]
[<ffffffff81195bbc>] ? lookup_page_cgroup+0x4c/0x50
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498677]
[<ffffffffa078ab40>] __xfs_get_blocks+0x280/0x550 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498688]
[<ffffffffa078ae41>] xfs_get_blocks+0x11/0x20 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498695]
[<ffffffff811cf77e>] __block_write_begin+0x1ae/0x4e0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498705]
[<ffffffffa078ae30>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498708]
[<ffffffff81135fff>] ? grab_cache_page_write_begin+0x8f/0xf0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498720]
[<ffffffffa078a09f>] xfs_vm_write_begin+0x5f/0xe0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498725]
[<ffffffff8113552a>] generic_perform_write+0xca/0x210
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498729]
[<ffffffff811356cd>] generic_file_buffered_write+0x5d/0x90
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498740]
[<ffffffffa07952d5>] xfs_file_buffered_aio_write+0x115/0x1c0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498744]
[<ffffffff816159f4>] ? ip_finish_output+0x224/0x3b0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498758]
[<ffffffffa079547c>] xfs_file_aio_write+0xfc/0x1b0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498769]
[<ffffffffa0795380>] ? xfs_file_buffered_aio_write+0x1c0/0x1c0 [xfs]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498773]
[<ffffffff8119b8c3>] do_sync_readv_writev+0xa3/0xe0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498776]
[<ffffffff8119bb8d>] do_readv_writev+0xcd/0x1d0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498780]
[<ffffffff810877e0>] ? set_groups+0x40/0x60
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498805]
[<ffffffffa033a6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498808]
[<ffffffff8119bccc>] vfs_writev+0x3c/0x50
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498815]
[<ffffffffa0333dd2>] nfsd_vfs_write.isra.12+0x92/0x350 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498818]
[<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498825]
[<ffffffffa0336679>] nfsd_write+0xf9/0x110 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498834]
[<ffffffffa0340dd1>] nfsd3_proc_write+0xb1/0x140 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498841]
[<ffffffffa032fd62>] nfsd_dispatch+0x102/0x270 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498865]
[<ffffffffa0103b48>] svc_process_common+0x328/0x5e0 [sunrpc]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498876]
[<ffffffffa0104153>] svc_process+0x103/0x160 [sunrpc]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498882]
[<ffffffffa032f72f>] nfsd+0xbf/0x130 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498888]
[<ffffffffa032f670>] ? nfsd_destroy+0x80/0x80 [nfsd]
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498892]
[<ffffffff8107f050>] kthread+0xc0/0xd0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498895]
[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498900]
[<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498903]
[<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.498905] SysRq : Show Memory
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499385] Mem-Info:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499388] Node 0 DMA per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499391] CPU 0: hi: 0,
btch: 1 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499392] CPU 1: hi: 0,
btch: 1 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499394] CPU 2: hi: 0,
btch: 1 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499395] CPU 3: hi: 0,
btch: 1 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499396] Node 0 DMA32
per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499399] CPU 0: hi: 186,
btch: 31 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499400] CPU 1: hi: 186,
btch: 31 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499402] CPU 2: hi: 186,
btch: 31 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499403] CPU 3: hi: 186,
btch: 31 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499404] Node 0 Normal
per-cpu:
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499406] CPU 0: hi: 186,
btch: 31 usd: 23
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499407] CPU 1: hi: 186,
btch: 31 usd: 25
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499409] CPU 2: hi: 186,
btch: 31 usd: 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499410] CPU 3: hi: 186,
btch: 31 usd: 7
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_anon:2143
inactive_anon:44181 isolated_anon:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] active_file:913373
inactive_file:1464930 isolated_file:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] unevictable:8449
dirty:6742 writeback:115 unstable:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] free:159418
slab_reclaimable:146857 slab_unreclaimable:66681
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] mapped:5561
shmem:383 pagetables:2195 bounce:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499415] free_cma:0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499418] Node 0 DMA
free:15884kB min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:15628kB managed:15884kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable?
yes
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499423] lowmem_reserve[]: 0
3512 12080 12080
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499427] Node 0 DMA32
free:240040kB min:19624kB low:24528kB high:29436kB active_anon:8kB
inactive_anon:15196kB active_file:1503256kB inactive_file:1624328kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596532kB
managed:3546428kB mlocked:0kB dirty:14652kB writeback:0kB mapped:988kB
shmem:0kB slab_reclaimable:173044kB slab_unreclaimable:15988kB
kernel_stack:248kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499432] lowmem_reserve[]: 0
0 8568 8568
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499435] Node 0 Normal
free:381748kB min:47872kB low:59840kB high:71808kB active_anon:8564kB
inactive_anon:161528kB active_file:2150236kB inactive_file:4235392kB
unevictable:33796kB isolated(anon):0kB isolated(file):0kB present:8773632kB
managed:8715068kB mlocked:33796kB dirty:12316kB writeback:460kB
mapped:21256kB shmem:1532kB slab_reclaimable:414384kB
slab_unreclaimable:250736kB kernel_stack:8424kB pagetables:8752kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499443] lowmem_reserve[]: 0
0 0 0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499446] Node 0 DMA: 1*4kB
(U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15884kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499458] Node 0 DMA32:
27954*4kB (UEM) 12312*8kB (UEM) 1713*16kB (UEM) 17*32kB (UMR) 2*64kB (R)
2*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB =
240440kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499469] Node 0 Normal:
31971*4kB (UEM) 31254*8kB (UEM) 2*16kB (EM) 5*32kB (R) 4*64kB (R) 2*128kB
(R) 0*256kB 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 381692kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499480] 2380174 total
pagecache pages
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499482] 0 pages in swap
cache
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499483] Swap cache stats:
add 0, delete 0, find 0/0
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499485] Free swap =
522236kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.499486] Total swap
=522236kB
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3145712 pages RAM
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 69093 pages
reserved
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 3401898 pages
shared
Jun 26 14:19:08 vsa-00000003-vc-1 kernel: [91809.528135] 1067069 pages
non-shared
...
34501 Jun 26 14:20:03 vsa-00000003-vc-1 kernel: [91864.156051] XFS:
pid=12642 kmem_alloc success after 13443 retries inum=35646862875 size=35584
flags=0x4 lflags=0x250
-----Original Message-----
From: Dave Chinner
Sent: 23 June, 2015 10:18 PM
To: Alex Lyakas
Cc: xfs@xxxxxxxxxxx ; hch@xxxxxx
Subject: Re: xfs_iext_realloc_indirect and "XFS: possible memory allocation
deadlock"
On Tue, Jun 23, 2015 at 10:18:21AM +0200, Alex Lyakas wrote:
Greetings,
We are hitting an issue with XFS printing messages like “XFS:
possible memory allocation deadlock in kmem_alloc
(mode:0x250)” and stack trace like in [1]. Eventually,
hung-task panic kicks in with stack traces like [2].
We are running kernel 3.8.13. I see that in
http://oss.sgi.com/archives/xfs/2012-01/msg00341.html a similar
issue has been discussed, but no code changes followed comparing
to what we have in 3.8.13.
Any suggestion on how to move forward with this problem? For
example, does this memory has to be really allocated with kmalloc
(i.e., physically continuous) or vmalloc can be used?
We left it alone because it is relatively rare for people to hit it,
and generally it indicates a severe fragmentation problem when they
do hit it (i.e. a file with millions of extents in it). Can you
track down the file that this is occurring against and see how many
extents it has?
i.e. you may be much better off by taking measures to avoid excessive
fragmentation than removing the canary from the mine...
[109626.075483] nfsd D 0000000000000002 0 20042 2
0x00000000
Hmmm - it's also a file written by the NFS server - this is is on an
a dedicated NFS server?
[109626.075483] [<ffffffffa01be6b0>] ? nfsd_setuser+0x120/0x2b0 [nfsd]
[109626.075483] [<ffffffff8119bccc>] ? vfs_writev+0x3c/0x50
[109626.075483] [<ffffffffa01b7dd2>] ? nfsd_vfs_write.isra.12+0x92/0x350
[nfsd]
[109626.075483] [<ffffffff8119a6cb>] ? dentry_open+0x6b/0xd0
[109626.075483] [<ffffffffa01ba679>] ? nfsd_write+0xf9/0x110 [nfsd]
[109626.075483] [<ffffffffa01c4dd1>] ? nfsd3_proc_write+0xb1/0x140 [nfsd]
Interesting that this is an NFSv3 write...
[87303.976119] INFO: task nfsd:5684 blocked for more than 180 seconds.
[87303.976976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[87303.978012] nfsd D 0000000000000003 0 5684 2
0x00000000
....
[87303.978174] [<ffffffffa0269623>] nfsd_write+0xa3/0x110 [nfsd]
[87303.978182] [<ffffffffa027794c>] nfsd4_write+0x1cc/0x250 [nfsd]
[87303.978189] [<ffffffffa027746c>] nfsd4_proc_compound+0x5ac/0x7a0
[nfsd]
And that is a NFsv4 write. You have multiple clients writing to the
same file using different versions of the NFS protocol?
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs