Re: [PATCH v3 1/8] mm/page_ext: resurrect struct page extending code for debugging

Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> · Wed, 3 Dec 2014 10:18:47 +0900

On Mon, Nov 24, 2014 at 05:15:19PM +0900, Joonsoo Kim wrote:
> When we debug something, we'd like to insert some information to
> every page. For this purpose, we sometimes modify struct page itself.
> But, this has drawbacks. First, it requires re-compile. This makes us
> hesitate to use the powerful debug feature so development process is
> slowed down. And, second, sometimes it is impossible to rebuild the kernel
> due to third party module dependency. At third, system behaviour would be
> largely different after re-compile, because it changes size of struct
> page greatly and this structure is accessed by every part of kernel.
> Keeping this as it is would be better to reproduce errornous situation.
> 
> This feature is intended to overcome above mentioned problems. This feature
> allocates memory for extended data per page in certain place rather than
> the struct page itself. This memory can be accessed by the accessor
> functions provided by this code. During the boot process, it checks whether
> allocation of huge chunk of memory is needed or not. If not, it avoids
> allocating memory at all. With this advantage, we can include this feature
> into the kernel in default and can avoid rebuild and solve related problems.
> 
> Until now, memcg uses this technique. But, now, memcg decides to embed
> their variable to struct page itself and it's code to extend struct page
> has been removed. I'd like to use this code to develop debug feature,
> so this patch resurrect it.
> 
> To help these things to work well, this patch introduces two callbacks
> for clients. One is the need callback which is mandatory if user wants
> to avoid useless memory allocation at boot-time. The other is optional,
> init callback, which is used to do proper initialization after memory
> is allocated. Detailed explanation about purpose of these functions is
> in code comment. Please refer it.
> 
> Others are completely same with previous extension code in memcg.
> 
> v3:
>  minor fix for readable code
> 
> v2:
>  describe overall design at the top of the page extension code.
>  add more description on commit message.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>

Hello, Andrew.

Could you fold following fix into the merged patch?
It fixes the problem on !CONFIG_SPARSEMEM which is reported by
0day kernel testing robot.

https://lkml.org/lkml/2014/11/28/123

Thanks.


------->8----------
>From 8436eaa754208d08f065e6317c7a16c7dfe1a766 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
Date: Wed, 3 Dec 2014 09:48:16 +0900
Subject: [PATCH] mm/page_ext: reserve more space in case of unaligned node
 range

When page allocator's buddy algorithm checks buddy's status,
checked page could be in invalid range. In this case, lookup_page_ext()
will return invalid address and it results in invalid address
defereference problem. For example, if node_start_pfn is 1 and page
with pfn 1 is freed to page allocator, page_is_buddy() would check
the page with pfn 0. In page_ext code, offset would be calculated
by pfn - node_start_pfn, so, 0 - 1 = -1. This causes following problem
reported by Fengguang.

[    0.480155] BUG: unable to handle kernel
[    0.480155] BUG: unable to handle kernel paging requestpaging request at d26bdffc
 at d26bdffc
[    0.481566] IP:
[    0.481566] IP: [<c110bc7a>] free_one_page+0x31a/0x3e0
 [<c110bc7a>] free_one_page+0x31a/0x3e0
[    0.482801] *pdpt = 0000000001866001
[    0.482801] *pdpt = 0000000001866001 *pde = 0000000012584067 *pde = 0000000012584067 *pte = 80000000126bd060 *pte = 80000000126bd060
[    0.483333] Oops: 0000 [#1]
[    0.483333] Oops: 0000 [#1] SMP SMP DEBUG_PAGEALLOCDEBUG_PAGEALLOC
snip...
[    0.483333] Call Trace:
[    0.483333] Call Trace:
[    0.483333]  [<c110bdec>] __free_pages_ok+0xac/0xf0
[    0.483333]  [<c110bdec>] __free_pages_ok+0xac/0xf0
[    0.483333]  [<c110c769>] __free_pages+0x19/0x30
[    0.483333]  [<c110c769>] __free_pages+0x19/0x30
[    0.483333]  [<c1144ca5>] kfree+0x75/0xf0
[    0.483333]  [<c1144ca5>] kfree+0x75/0xf0
[    0.483333]  [<c111b595>] ? kvfree+0x45/0x50
[    0.483333]  [<c111b595>] ? kvfree+0x45/0x50
[    0.483333]  [<c111b595>] kvfree+0x45/0x50
[    0.483333]  [<c111b595>] kvfree+0x45/0x50
[    0.483333]  [<c134bb73>] rhashtable_expand+0x1b3/0x1e0
[    0.483333]  [<c134bb73>] rhashtable_expand+0x1b3/0x1e0
[    0.483333]  [<c17fc9f9>] test_rht_init+0x173/0x2e8
[    0.483333]  [<c17fc9f9>] test_rht_init+0x173/0x2e8
[    0.483333]  [<c134b750>] ? jhash2+0xe0/0xe0
[    0.483333]  [<c134b750>] ? jhash2+0xe0/0xe0
[    0.483333]  [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[    0.483333]  [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[    0.483333]  [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[    0.483333]  [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[    0.483333]  [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[    0.483333]  [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[    0.483333]  [<c134b750>] ? jhash2+0xe0/0xe0
[    0.483333]  [<c134b750>] ? jhash2+0xe0/0xe0
[    0.483333]  [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[    0.483333]  [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[    0.483333]  [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[    0.483333]  [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[    0.483333]  [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[    0.483333]  [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[    0.483333]  [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[    0.483333]  [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[    0.483333]  [<c1000486>] do_one_initcall+0xc6/0x210
[    0.483333]  [<c1000486>] do_one_initcall+0xc6/0x210
[    0.483333]  [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[    0.483333]  [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[    0.483333]  [<c17d0505>] ? repair_env_string+0x12/0x54
[    0.483333]  [<c17d0505>] ? repair_env_string+0x12/0x54
[    0.483333]  [<c17d0cf3>] kernel_init_freeable+0x193/0x213
[    0.483333]  [<c17d0cf3>] kernel_init_freeable+0x193/0x213
[    0.483333]  [<c1512500>] kernel_init+0x10/0xf0
[    0.483333]  [<c1512500>] kernel_init+0x10/0xf0
[    0.483333]  [<c151c5c1>] ret_from_kernel_thread+0x21/0x30
[    0.483333]  [<c151c5c1>] ret_from_kernel_thread+0x21/0x30
[    0.483333]  [<c15124f0>] ? rest_init+0xb0/0xb0
[    0.483333]  [<c15124f0>] ? rest_init+0xb0/0xb0
snip...
[    0.483333] EIP: [<c110bc7a>]
[    0.483333] EIP: [<c110bc7a>] free_one_page+0x31a/0x3e0free_one_page+0x31a/0x3e0 SS:ESP 0068:c0041de0
 SS:ESP 0068:c0041de0
[    0.483333] CR2: 00000000d26bdffc
[    0.483333] CR2: 00000000d26bdffc
[    0.483333] ---[ end trace 7648e12f817ef2ad ]---
[    0.483333] ---[ end trace 7648e12f817ef2ad ]---

This case is already handled in case of struct page by considering
alignment of node_start_pfn. So, this patch follows that way to fix this
situation.

Reported-by: Fengguang Wu <fengguang.wu@xxxxxxxxx>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
---
 mm/page_ext.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index ce86485..184f3ef 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -112,7 +112,8 @@ struct page_ext *lookup_page_ext(struct page *page)
 	if (unlikely(!base))
 		return NULL;
 #endif
-	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	offset = pfn - round_down(node_start_pfn(page_to_nid(page)),
+					MAX_ORDER_NR_PAGES);
 	return base + offset;
 }
 
@@ -126,6 +127,15 @@ static int __init alloc_node_page_ext(int nid)
 	if (!nr_pages)
 		return 0;
 
+	/*
+	 * Need extra space if node range is not aligned with
+	 * MAX_ORDER_NR_PAGES. When page allocator's buddy algorithm
+	 * checks buddy's status, range could be out of exact node range.
+	 */
+	if (!IS_ALIGNED(node_start_pfn(nid), MAX_ORDER_NR_PAGES) ||
+		!IS_ALIGNED(node_end_pfn(nid), MAX_ORDER_NR_PAGES))
+		nr_pages += MAX_ORDER_NR_PAGES;
+
 	table_size = sizeof(struct page_ext) * nr_pages;
 
 	base = memblock_virt_alloc_try_nid_nopanic(
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>