Hi,
Enrik Berkhan wrote:
currently we're experiencing some process hangs that seem to be
ext4-related. (Kernel 2.6.28.10-Blackfin, i.e. with Analog Devices
patches including some memory management changes for NOMMU.)
The situation is as follows:
We have two threads writing to an ext4-filesystem. After several hours
and accross about 20 systems there happens one hang where
(reconstructed from Alt-SysRq-W output):
1. pdflush waits in start_this_handle
2. kjournald2 waits in jdb2_journal_commit_transaction
3. thread 1 waits in start_this_handle
4. thread 2 waits in
ext4_da_write_begin
(start_this_handle succeeded)
grab_cache_page_write_begin
__alloc_pages_internal
try_to_free_pages
do_try_to_free_pages
congestion_wait
Actually, thread 2 shouldn't be completely blocked, because
congestion_wait has a timeout if I understand the code correctly.
Unfortunately, I pressed Alt-SysRq-W only once when having a chance to
reproduce the problem on a test system with console access.
Meanwhile, I have found out that thread 2 actually isn't completely
blocked but loops in __alloc_pages_internal:
get_page_from_freelist() doesn't return a page;
try_to_free_pages() returns did_some_progress == 0;
later, do_retry == 1 and the loop restarts with goto rebalance;
I have implemented a workaround by limiting the "goto rebalance" retries
to 3 iterations and triggering a page allocation failure otherwise.
I have changed ext4_{da_,}write_begin() to retry
grab_cache_page_write_begin() 3 times, too, by calling
ext4_journal_stop(); schedule_timeout(1) and goto retry;
Using this change the system recovers from the situation on first
grab_cache_page_write_begin() retry. I have seen that 17 times during a
test last night.
Interestingly, the show_mem() call triggered in __alloc_pages_internal()
on page allocation failure shows plenty of memory free, see this example:
Mar 5 02:26:16 bfp53 [21804.947985] manage_daq: page allocation failure. order:0, mode:0x120052
Mar 5 02:26:16 bfp53 [21804.954607] Hardware Trace:
Mar 5 02:26:16 bfp53 [21804.957377] 0 Target : <0x000049ac> { _dump_stack + 0x0 }
Mar 5 02:26:16 bfp53 [21804.963139] Source : <0x00037618> { ___alloc_pages_internal + 0x1cc } CALL pcrel
Mar 5 02:26:16 bfp53 [21804.970996] 1 Target : <0x00037618> { ___alloc_pages_internal + 0x1cc }
Mar 5 02:26:16 bfp53 [21804.977966] Source : <0x0000fd2a> { _printk + 0x16 } RTS
Mar 5 02:26:16 bfp53 [21804.983735] 2 Target : <0x0000fd26> { _printk + 0x12 }
Mar 5 02:26:16 bfp53 [21804.989226] Source : <0x0001058e> { _vprintk + 0x116 } RTS
Mar 5 02:26:16 bfp53 [21804.995165] 3 Target : <0x00010586> { _vprintk + 0x10e }
Mar 5 02:26:16 bfp53 [21805.005705] Source : <0xffa00c76> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.013596] 4 Target : <0xffa00c14> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.022158] Source : <0xffa00a5c> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.030932] 5 Target : <0xffa00a5c> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.040640] Source : <0xffa00a32> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.048552] 6 Target : <0xffa00a04> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.057211] Source : <0xffa00c10> /* kernel dynamic memory */
Mar 5 02:26:16 bfp53 [21805.065086] 7 Target : <0xffa00c0e> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.072947] Source : <0xffa00efe> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.081598] 8 Target : <0xffa00efa> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.090706] Source : <0xffa003d2> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.098692] 9 Target : <0xffa003ca> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.104880] Source : <0x00014024> { _irq_exit + 0x50 } RTS
Mar 5 02:26:17 bfp53 [21805.110793] 10 Target : <0x0001401e> { _irq_exit + 0x4a }
Mar 5 02:26:17 bfp53 [21805.116454] Source : <0x0001402c> { _irq_exit + 0x58 } IF CC JUMP
Mar 5 02:26:17 bfp53 [21805.123013] 11 Target : <0x0001402a> { _irq_exit + 0x56 }
Mar 5 02:26:17 bfp53 [21805.128674] Source : <0x00009d8c> { _idle_cpu + 0x1c } RTS
Mar 5 02:26:17 bfp53 [21805.134643] 12 Target : <0x00009d70> { _idle_cpu + 0x0 }
Mar 5 02:26:17 bfp53 [21805.140192] Source : <0x00014026> { _irq_exit + 0x52 } CALL pcrel
Mar 5 02:26:17 bfp53 [21805.146756] 13 Target : <0x00014026> { _irq_exit + 0x52 }
Mar 5 02:26:17 bfp53 [21805.152471] Source : <0x0001401c> { _irq_exit + 0x48 } IF !CC JUMP
Mar 5 02:26:17 bfp53 [21805.159064] 14 Target : <0x00014006> { _irq_exit + 0x32 }
Mar 5 02:26:17 bfp53 [21805.164783] Source : <0x00014048> { _irq_exit + 0x74 } JUMP.S
Mar 5 02:26:17 bfp53 [21805.170932] 15 Target : <0x00014048> { _irq_exit + 0x74 }
Mar 5 02:26:17 bfp53 [21805.176592] Source : <0x00013d6e> { ___do_softirq + 0xa2 } RTS
Mar 5 02:26:17 bfp53 [21805.182864] Stack info:
Mar 5 02:26:17 bfp53 [21805.191811] SP: [0x02ab9b60] <0x02ab9b60> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.198495] FP: (0x02ab9b64)
Mar 5 02:26:17 bfp53 [21805.201403] Memory from 0x02ab9b60 to 02aba000
Mar 5 02:26:17 bfp53 [21805.205835] 02ab9b60:[02ab9bbc](02ab9bbc)<0003761c> 03bde680 00000000 00120052 03bde880 00000000
Mar 5 02:26:17 bfp53 [21805.215086] 02ab9b80: 00120052 00000002 00000042 02ab8000 02ab8000 00000000 00120052 00000000
Mar 5 02:26:17 bfp53 [21805.224375] 02ab9ba0: 02ab8000 00000010 00000002 00000000 00000000 00000000 00000000 (02ab9bf0)
Mar 5 02:26:17 bfp53 [21805.233690] 02ab9bc0:<00033300><02f4e3dc> 04a85000 <02f4e344> 00000000 00000080 00004a85 ffffff7f
Mar 5 02:26:17 bfp53 [21805.243044] 02ab9be0: 02ab9c04 <00a8f76e> 00000000 00000000 (02ab9c44)<00a8892c> 00000000 031423d8
Mar 5 02:26:17 bfp53 [21805.252333] 02ab9c00: 00000000 00000004 031423d8 000005cc 04a85000 000005cc 00000000 000005cc
Mar 5 02:26:17 bfp53 [21805.261648] 02ab9c20: 04a84a34 00000000 000005cc 000005cc 04a86000 00000000 00004a85 00000003
Mar 5 02:26:17 bfp53 [21805.271002] 02ab9c40: 00000000 (02ab9ca4)<00032b08> 02ab8000 00001000 <02f4e3dc> 000005cc 00000000
Mar 5 02:26:17 bfp53 [21805.280292] 02ab9c60: 04a85000 00001000 0306a260 <02f4e3dc> 04a85000 00000000 00001000 00000000
Mar 5 02:26:17 bfp53 [21805.289608] 02ab9c80: 02ab9ca0 02ab9c9c 00000000 00a9fc74 000005cc 00000000 00000000 00000000
Mar 5 02:26:17 bfp53 [21805.298962] 02ab9ca0: 00265100 (02ab9cf0)<00033a6c> 0306a260 <02f4e3dc><02f4e344> 02ab9ce0 00000000
Mar 5 02:26:17 bfp53 [21805.308252] 02ab9cc0: 04a84a34 00000000 0306a260 02ab9ce0 04a84a34 00000000 00000000 00a9fc74
Mar 5 02:26:17 bfp53 [21805.317566] 02ab9ce0: 02ab9e94 00000001 000005cc 00001c24 (02ab9d60)<00033e8a><02f4e344> 000021f0
Mar 5 02:26:17 bfp53 [21805.326919] 02ab9d00: 0306a260 00000000 02ab9e94 00000000 04a84a34 02ab9e10 02ab9dfc <000fda3a>
Mar 5 02:26:17 bfp53 [21805.336211] 02ab9d20: 04a84a34 00000000 02ab9e58 000021f0 00000000 00000000 <02ab9d70><02f4e3dc>
Mar 5 02:26:17 bfp53 [21805.345529] 02ab9d40: 00000000 00000001 ffffffff 00000000 00000000 00000000 00000000 000021f0
Mar 5 02:26:17 bfp53 [21805.354882] 02ab9d60:<02ab9da0><0003440a> 02ab9e10 <02f4e344> 0306a260 02f4e3b0 02ab9e94 00000000
Mar 5 02:26:17 bfp53 [21805.364168] 02ab9d80: 04a84a34 02ab9e10 02ab9e94 00000001 02ab9e58 <0001410e> 02ab9dbc <02f4e3dc>
Mar 5 02:26:17 bfp53 [21805.364168] 02ab9d80: 04a84a34 02ab9e10 02ab9e94 00000001 02ab9e58 <0001410e> 02ab9dbc <02f4e3dc>
Mar 5 02:26:17 bfp53 [21805.373488] 02ab9da0: 02ab9dd8 <00a81c28><02f4e344> 0306a260 02f4e2cc 00000000 04a84a34 00000001
Mar 5 02:26:17 bfp53 [21805.382834] 02ab9dc0: 02ab9e94 00000000 00040000 00000001 04a84a34 00000000 02ab9e9c <00042f34>
Mar 5 02:26:17 bfp53 [21805.392127] 02ab9de0: 0306a260 02ab9eec 00e00fb8 02ab9e10 02ab9e94 000021f0 00dda938 02ab9e10
Mar 5 02:26:17 bfp53 [21805.401446] 02ab9e00:<000fdc24> 00bdcd40 04a84a34 00000000 <000fdc5c> 003cf20e 00000000 00000001
Mar 5 02:26:17 bfp53 [21805.410791] 02ab9e20: ffffffff 0306a260 00000000 00000000 00000000 00000000 03bde680 00000000
Mar 5 02:26:17 bfp53 [21805.420082] 02ab9e40: 00000000 001f8c3c 03bde680 00020280 02ab9e50 02ab9e50 04a84a34 00000000
Mar 5 02:26:17 bfp53 [21805.429405] 02ab9e60:<00013d28> 001f18fc 02ab8000 001f1888 000021f0 02ab9e94 <00013d68> 001f18fc
Mar 5 02:26:17 bfp53 [21805.438755] 02ab9e80: 00000000 00000000 02ab9ea4 <00008f04> 001f63f4 02b00004 000021f0 02ab9ec4
Mar 5 02:26:17 bfp53 [21805.448041] 02ab9ea0:<00043514> 0306a260 00000004 02ab9eec 02b00004 00000000 00000000 02ab9ee0
Mar 5 02:26:17 bfp53 [21805.457364] 02ab9ec0: 02ab9eec 02ab9ef8 <000437f8> 0306a260 02b00004 000021f0 02b00004 00000141
Mar 5 02:26:17 bfp53 [21805.466714] 02ab9ee0: 00000000 ffffe000 02ab9eec 04a84a34 00000000 00000001 00000000 ffa008d0
Mar 5 02:26:17 bfp53 [21805.475999] 02ab9f00: 000437c4 00000000 ffffe000 030cf8dc 00e09388 000005a8 00000000 00dda938
Mar 5 02:26:17 bfp53 [21805.485322] 02ab9f20: 02b00004 003cf20e 00008000 00000000 00000000 02aba000 003cf20e 003cf20e
Mar 5 02:26:17 bfp53 [21805.494675] 02ab9f40:<00dcdb62> ffa00e46 02001025 003de2bb 003e03f1 003de2ba 003e03ee 00000000
Mar 5 02:26:17 bfp53 [21805.503958] 02ab9f60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mar 5 02:26:17 bfp53 [21805.513281] 02ab9f80: 00000000 00000000 00000000 00000000 00000000 039a0510 039a0508 0360af3c
Mar 5 02:26:17 bfp53 [21805.522631] 02ab9fa0: 00000020 00000000 00000000 00000000 003f7838 02ab3c7c 02ab3c88 00df09b0
Mar 5 02:26:17 bfp53 [21805.531916] 02ab9fc0: 00000017 00e00fb8 00df13f8 003cf204 00000004 00000000 000021f0 02b00004
Mar 5 02:26:17 bfp53 [21805.541236] 02ab9fe0: 00dda938 00000000 000021f0 02b00004 00000017 00000017 00000004 00000006
Mar 5 02:26:17 bfp53 [21805.550731] Return addresses in stack:
Mar 5 02:26:17 bfp53 [21805.554584] frame 1 : <0x0003761c> { ___alloc_pages_internal + 0x1d0 }
Mar 5 02:26:17 bfp53 [21805.561534] frame 2 : <0x00033300> { _grab_cache_page_write_begin + 0x50 }
Mar 5 02:26:17 bfp53 [21805.584838] address : <0x02f4e3dc> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.592680] address : <0x02f4e344> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.599028] address : <0x00a8f76e> { :ext4:_ext4_journal_start_sb + 0x26 }
Mar 5 02:26:17 bfp53 [21805.606329] frame 3 : <0x00a8892c> { :ext4:_ext4_da_write_begin + 0xac }
Mar 5 02:26:17 bfp53 [21805.613392] frame 4 : <0x00032b08> { _generic_perform_write + 0x98 }
Mar 5 02:26:17 bfp53 [21805.622262] address : <0x02f4e3dc> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.630250] address : <0x02f4e3dc> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.636537] frame 5 : <0x00033a6c> { _generic_file_buffered_write + 0x50 }
Mar 5 02:26:17 bfp53 [21805.645740] address : <0x02f4e3dc> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.653708] address : <0x02f4e344> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.659928] frame 6 : <0x00033e8a> { ___generic_file_aio_write_nolock + 0x1ba }
Mar 5 02:26:17 bfp53 [21805.669659] address : <0x02f4e344> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.675988] address : <0x000fda3a> { _sock_recvmsg + 0x96 }
Mar 5 02:26:17 bfp53 [21805.684148] address : <0x02ab9d70> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.692186] address : <0x02f4e3dc> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.700105] address : <0x02ab9da0> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.706305] address : <0x0003440a> { _generic_file_aio_write + 0x52 }
Mar 5 02:26:17 bfp53 [21805.715083] address : <0x02f4e344> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.721392] address : <0x0001410e> { _irq_enter + 0xa }
Mar 5 02:26:17 bfp53 [21805.728914] address : <0x02f4e3dc> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.735283] address : <0x00a81c28> { :ext4:_ext4_file_write + 0x40 }
Mar 5 02:26:17 bfp53 [21805.744015] address : <0x02f4e344> /* kernel dynamic memory */
Mar 5 02:26:17 bfp53 [21805.750296] address : <0x00042f34> { _do_sync_write + 0x90 }
Mar 5 02:26:17 bfp53 [21805.756313] address : <0x000fdc24> { _sys_recvfrom + 0x64 }
Mar 5 02:26:17 bfp53 [21805.762262] address : <0x000fdc5c> { _sys_recvfrom + 0x9c }
Mar 5 02:26:17 bfp53 [21805.768263] address : <0x00013d28> { ___do_softirq + 0x5c }
Mar 5 02:26:17 bfp53 [21805.774254] address : <0x00013d68> { ___do_softirq + 0x9c }
Mar 5 02:26:17 bfp53 [21805.780216] address : <0x00008f04> { _bfin_handle_irq + 0x1c }
Mar 5 02:26:17 bfp53 [21805.786449] address : <0x00043514> { _vfs_write + 0x60 }
Mar 5 02:26:17 bfp53 [21805.792159] address : <0x000437f8> { _sys_write + 0x34 }
Mar 5 02:26:17 bfp53 [21805.800542] address : <0x00dcdb62> [ /opt/nge/lib/libngus.so.0.0.0 + 0xdb62 ]
Mar 5 02:26:17 bfp53 [21805.808171] Mem-Info:
Mar 5 02:26:17 bfp53 [21805.810372] DMA per-cpu:
Mar 5 02:26:17 bfp53 [21805.812915] CPU 0: hi: 18, btch: 3 usd: 15
Mar 5 02:26:17 bfp53 [21805.817749] Active_anon:0 active_file:0 inactive_anon:0
Mar 5 02:26:17 bfp53 [21805.817768] inactive_file:620 dirty:565 writeback:0 unstable:0
Mar 5 02:26:17 bfp53 [21805.817786] free:7162 slab:1112 mapped:0 pagetables:0 bounce:0
Mar 5 02:26:17 bfp53 [21805.835000] DMA free:28648kB min:984kB low:1228kB high:1476kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:2480kB present:60956kB pages_scanned:0 all_unreclaimable? no
Mar 5 02:26:17 bfp53 [21805.851771] lowmem_reserve[]: 0 0 0
Mar 5 02:26:17 bfp53 [21805.855236] DMA: 318*4kB 396*8kB 299*16kB 205*32kB 115*64kB 29*128kB 5*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 28648kB
Mar 5 02:26:17 bfp53 [21805.868869] 620 total pagecache pages
Mar 5 02:26:17 bfp53 [21805.876098] 15359 pages RAM
Mar 5 02:26:17 bfp53 [21805.878803] 673 pages reserved
Mar 5 02:26:17 bfp53 [21805.881839] 589 pages shared
Mar 5 02:26:17 bfp53 [21805.884771] 6850 pages non-shared
Mar 5 02:26:17 bfp53 [21805.888129] ext4_da_write_begin: retrying grab_cache_page_write_begin()
I will attach my workaround patch for reference, too.
Can anybody explain this behaviour and maybe direct me to the root cause?
Of course, this now looks more like a page allocation problem than an
ext4 one.
Thanks,
Enrik
---
fs/ext4/inode.c | 24 ++++++++++++++++++++++++
mm/page_alloc.c | 4 +++-
2 files changed, 27 insertions(+), 1 deletion(-)
Index: linux-2.6.28.10/fs/ext4/inode.c
===================================================================
--- linux-2.6.28.10.orig/fs/ext4/inode.c 2010-03-03 15:29:23.000000000 +0100
+++ linux-2.6.28.10/fs/ext4/inode.c 2010-03-03 15:53:20.000000000 +0100
@@ -1335,6 +1335,7 @@ static int ext4_write_begin(struct file
struct page *page;
pgoff_t index;
unsigned from, to;
+ int grab_cache_page_retries = 3;
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1354,6 +1355,17 @@ retry:
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page) {
ext4_journal_stop(handle);
+
+ /* retry to get memory because the reason for OOM could be
+ * AOP_FLAG_NOFS! */
+ if (grab_cache_page_retries) {
+ printk(KERN_DEBUG "%s: retrying "
+ "grab_cache_page_write_begin()\n", __func__);
+ grab_cache_page_retries--;
+ schedule_timeout_uninterruptible(1);
+ goto retry;
+ }
+
ret = -ENOMEM;
goto out;
}
@@ -2584,6 +2596,7 @@ static int ext4_da_write_begin(struct fi
unsigned from, to;
struct inode *inode = mapping->host;
handle_t *handle;
+ int grab_cache_page_retries = 3;
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
@@ -2614,6 +2627,17 @@ retry:
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page) {
ext4_journal_stop(handle);
+
+ /* retry to get memory because the reason for OOM could be
+ * AOP_FLAG_NOFS! */
+ if (grab_cache_page_retries) {
+ printk(KERN_DEBUG "%s: retrying "
+ "grab_cache_page_write_begin()\n", __func__);
+ grab_cache_page_retries--;
+ schedule_timeout_uninterruptible(1);
+ goto retry;
+ }
+
ret = -ENOMEM;
goto out;
}
Index: linux-2.6.28.10/mm/page_alloc.c
===================================================================
--- linux-2.6.28.10.orig/mm/page_alloc.c 2010-03-03 15:52:18.000000000 +0100
+++ linux-2.6.28.10/mm/page_alloc.c 2010-03-03 15:52:22.000000000 +0100
@@ -1476,6 +1476,7 @@ __alloc_pages_internal(gfp_t gfp_mask, u
int alloc_flags;
unsigned long did_some_progress;
unsigned long pages_reclaimed = 0;
+ int retries = 3;
might_sleep_if(wait);
@@ -1659,8 +1660,9 @@ nofail_alloc:
if (gfp_mask & __GFP_NOFAIL)
do_retry = 1;
}
- if (do_retry) {
+ if (do_retry && retries) {
congestion_wait(WRITE, HZ/50);
+ retries--;
goto rebalance;
}