+ vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Wed, 11 Dec 2013 16:20:41 -0800

Subject: + vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel.patch added to -mm tree
To: d.hatayama@xxxxxxxxxxxxxx,ebiederm@xxxxxxxxxxxx,kumagai-atsushi@xxxxxxxxxxxxxxxxx,vgoyal@xxxxxxxxxx
From: akpm@xxxxxxxxxxxxxxxxxxxx
Date: Wed, 11 Dec 2013 16:20:41 -0800


The patch titled
     Subject: vmcore: copy fractional pages into buffers in the kdump 2nd kernel
has been added to the -mm tree.  Its filename is
     vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: HATAYAMA Daisuke <d.hatayama@xxxxxxxxxxxxxx>
Subject: vmcore: copy fractional pages into buffers in the kdump 2nd kernel

As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real world
there's platform that allocates System RAM area and Reserved area in a
single same page.  As a result, mmap fails at sanity check that compares
memory cache types in a given range, causing user-land tools to exit
abnormally in the middle of crash dumping.

Although in the current case the data in Reserved area is ACPI data, in
general, arbitrary data can possibly be located in a single page together
with System RAM area.  If they are, for example, mmio, read or write to
the area could affect the corresponding devices and so a whole system.  We
should avoid doing such operations as much as possible in order to keep
reliability.

To address this issue, we copy fractional pages into buffers in the kdump
2nd kernel, and then read data on the fractional pages from the buffers in
the kdump 2nd kernel, not from the fractional pages on the kdump 1st
kernel.  Similarly, we mmap data on the buffers on the 2nd kernel, not on
the 1st kernel.  These are done just as we've already done for ELF note
segments.

Rigorously, we should avoid even mapping pages containing non-System RAM
area since mapping could cause some platform specific optimization that
could then lead to some kind of prefetch to the page.  However, as long as
trying to read the System RAM area in the page, we cannot avoid mapping
the page.  Therefore, reliable possible way is to supress the number of
times of reading the fractional pages to just once by buffering System RAM
part of the fractional page in the 2nd kerenel.

To implement this, extend vmcore structure to represent object in buffer
on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag; for a vmcore
object, if it has VMCORE_2ND_KERNEL set, then its data is on the buffer on
the 2nd kernel that is pointed to by ->buf member.

Only non-trivial case is where multiple System RAM areas are contained in
a single page.  I want to think there's unlikely to be such system, but
the issue addressed here is already odd enough, so we should consider
there would be likely enough to be.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@xxxxxxxxxxxxxx>
Reported-by: Vivek Goyal <vgoyal@xxxxxxxxxx>
Cc: Atsushi Kumagai <kumagai-atsushi@xxxxxxxxxxxxxxxxx>
Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 fs/proc/vmcore.c      |  271 +++++++++++++++++++++++++++++++++-------
 include/linux/kcore.h |    4 
 2 files changed, 229 insertions(+), 46 deletions(-)

diff -puN fs/proc/vmcore.c~vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel fs/proc/vmcore.c

--- a/fs/proc/vmcore.c~vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel
+++ a/fs/proc/vmcore.c
@@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffe
 
 	list_for_each_entry(m, &vmcore_list, list) {
 		if (*fpos < m->offset + m->size) {
-			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
-			start = m->paddr + *fpos - m->offset;
-			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
-			if (tmp < 0)
-				return tmp;
+			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
+			if ((m->flags & VMCORE_2ND_KERNEL)) {
+				void *kaddr;
+
+				kaddr = m->buf + *fpos - m->offset;
+				if (copy_to(buffer, kaddr, tsz, userbuf))
+					return -EFAULT;
+			} else {
+				start = m->paddr + *fpos - m->offset;
+				tmp = read_from_oldmem(buffer, tsz, &start,
+						       userbuf);
+				if (tmp < 0)
+					return tmp;
+			}
 			buflen -= tsz;
 			*fpos += tsz;
 			buffer += tsz;
@@ -300,10 +309,10 @@ static const struct vm_operations_struct
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
+ * alloc_copy_buf - allocate buffer to copy ELF note segment or
+ *                      fractional pages in vmalloc memory
  *
- * @notes_sz: size of buffer
+ * @sz: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -311,12 +320,12 @@ static const struct vm_operations_struct
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *alloc_copy_buf(size_t sz)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(sz);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(sz);
 #endif
 }
 
@@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file
 
 	list_for_each_entry(m, &vmcore_list, list) {
 		if (start < m->offset + m->size) {
-			u64 paddr = 0;
-
 			tsz = min_t(size_t, m->offset + m->size - start, size);
-			paddr = m->paddr + start - m->offset;
-			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
-						   paddr >> PAGE_SHIFT, tsz,
-						   vma->vm_page_prot))
-				goto fail;
+			if ((m->flags & VMCORE_2ND_KERNEL)) {
+				unsigned long uaddr = vma->vm_start + len;
+				void *kaddr = m->buf + start - m->offset;
+
+				if (remap_vmalloc_range_partial(vma, uaddr,
+								kaddr, tsz))
+					goto fail;
+			} else {
+				u64 paddr = paddr = m->paddr+start-m->offset;
+
+				if (remap_oldmem_pfn_range(vma,
+							   vma->vm_start + len,
+							   paddr >> PAGE_SHIFT,
+							   tsz,
+							   vma->vm_page_prot))
+					goto fail;
+			}
 			size -= tsz;
 			start += tsz;
 			len += tsz;
@@ -580,7 +599,7 @@ static int __init merge_note_headers_elf
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = alloc_copy_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -760,7 +779,7 @@ static int __init merge_note_headers_elf
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = alloc_copy_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -807,7 +826,7 @@ static int __init process_ptload_program
 	Elf64_Ehdr *ehdr_ptr;
 	Elf64_Phdr *phdr_ptr;
 	loff_t vmcore_off;
-	struct vmcore *new;
+	struct vmcore *m, *new;
 
 	ehdr_ptr = (Elf64_Ehdr *)elfptr;
 	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
@@ -816,27 +835,106 @@ static int __init process_ptload_program
 	vmcore_off = elfsz + elfnotes_sz;
 
 	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
-		u64 paddr, start, end, size;
+		u64 start, end, size, rest;
+		u64 start_up, start_down, end_up, end_down;
+		loff_t offset;
+		int rc, reuse = 0;
 
 		if (phdr_ptr->p_type != PT_LOAD)
 			continue;
 
-		paddr = phdr_ptr->p_offset;
-		start = rounddown(paddr, PAGE_SIZE);
-		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
-		size = end - start;
+		start = phdr_ptr->p_offset;
+		start_up = roundup(start, PAGE_SIZE);
+		start_down = rounddown(start, PAGE_SIZE);
+
+		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+		end_up = roundup(end, PAGE_SIZE);
+		end_down = rounddown(end, PAGE_SIZE);
+
+		size = end_up - start_down;
+		rest = phdr_ptr->p_memsz;
+
+		/* Add a head fractional page to vmcore list. */
+		if (!PAGE_ALIGNED(start)) {
+			/* Reuse the same buffer if multiple System
+			 * RAM entries show up in the same page. */
+			list_for_each_entry(m, vc_list, list) {
+				if (m->paddr == start_down &&
+				    m->flags == VMCORE_2ND_KERNEL) {
+					new = m;
+					reuse = 1;
+					goto skip;
+				}
+			}
+
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = start_down;
+			list_add_tail(&new->list, vc_list);
+		skip:
+
+			offset = start;
+			rc = __read_vmcore(new->buf + (start - start_down),
+					   min(start_up, end) - start,
+					   &offset, 0);
+			if (rc < 0)
+				return rc;
+
+			rest -= min(start_up, end) - start;
+		}
 
 		/* Add this contiguous chunk of memory to vmcore list.*/
-		new = get_new_element();
-		if (!new)
-			return -ENOMEM;
-		new->paddr = start;
-		new->size = size;
-		list_add_tail(&new->list, vc_list);
+		if (rest > 0 && start_up < end_down) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->size = end_down - start_up;
+			new->paddr = start_up;
+			list_add_tail(&new->list, vc_list);
+			rest -= end_down - start_up;
+		}
+
+		/* Add a tail fractional page to vmcore list. */
+		if (rest > 0) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = end_down;
+			list_add_tail(&new->list, vc_list);
+
+			offset = end_down;
+			rc = __read_vmcore(new->buf, end - end_down, &offset,
+					   0);
+			if (rc < 0)
+				return rc;
+
+			rest -= end - end_down;
+		}
+
+		WARN_ON(rest > 0);
 
 		/* Update the program header offset. */
-		phdr_ptr->p_offset = vmcore_off + (paddr - start);
+		phdr_ptr->p_offset = vmcore_off + (start - start_down);
 		vmcore_off = vmcore_off + size;
+		if (reuse) {
+			phdr_ptr->p_offset -= PAGE_SIZE;
+			vmcore_off -= PAGE_SIZE;
+		}
 	}
 	return 0;
 }
@@ -850,7 +948,7 @@ static int __init process_ptload_program
 	Elf32_Ehdr *ehdr_ptr;
 	Elf32_Phdr *phdr_ptr;
 	loff_t vmcore_off;
-	struct vmcore *new;
+	struct vmcore *m, *new;
 
 	ehdr_ptr = (Elf32_Ehdr *)elfptr;
 	phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
@@ -859,27 +957,106 @@ static int __init process_ptload_program
 	vmcore_off = elfsz + elfnotes_sz;
 
 	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
-		u64 paddr, start, end, size;
+		u64 start, end, size, rest;
+		u64 start_up, start_down, end_up, end_down;
+		loff_t offset;
+		int rc, reuse = 0;
 
 		if (phdr_ptr->p_type != PT_LOAD)
 			continue;
 
-		paddr = phdr_ptr->p_offset;
-		start = rounddown(paddr, PAGE_SIZE);
-		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
-		size = end - start;
+		start = phdr_ptr->p_offset;
+		start_up = roundup(start, PAGE_SIZE);
+		start_down = rounddown(start, PAGE_SIZE);
+
+		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+		end_up = roundup(end, PAGE_SIZE);
+		end_down = rounddown(end, PAGE_SIZE);
+
+		size = end_up - start_down;
+		rest = phdr_ptr->p_memsz;
+
+		/* Add a head fractional page to vmcore list. */
+		if (!PAGE_ALIGNED(start)) {
+			/* Reuse the same buffer if multiple System
+			 * RAM entries show up in the same page. */
+			list_for_each_entry(m, vc_list, list) {
+				if (m->paddr == start_down &&
+				    m->flags == VMCORE_2ND_KERNEL) {
+					new = m;
+					reuse = 1;
+					goto skip;
+				}
+			}
+
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->paddr = start_down;
+			new->size = PAGE_SIZE;
+			list_add_tail(&new->list, vc_list);
+		skip:
+
+			offset = start;
+			rc = __read_vmcore(new->buf + (start - start_down),
+					   min(start_up, end) - start,
+					   &offset, 0);
+			if (rc < 0)
+				return rc;
+
+			rest -= min(start_up, end) - start;
+		}
 
 		/* Add this contiguous chunk of memory to vmcore list.*/
-		new = get_new_element();
-		if (!new)
-			return -ENOMEM;
-		new->paddr = start;
-		new->size = size;
-		list_add_tail(&new->list, vc_list);
+		if (rest > 0 && start_up < end_down) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->size = end_down - start_up;
+			new->paddr = start_up;
+			list_add_tail(&new->list, vc_list);
+			rest -= end_down - start_up;
+		}
+
+		/* Add a tail fractional page to vmcore list. */
+		if (rest > 0) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = (void *)get_zeroed_page(GFP_KERNEL);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = end_down;
+			list_add_tail(&new->list, vc_list);
+
+			offset = end_down;
+			rc = __read_vmcore(new->buf, end - end_down, &offset,
+					   0);
+			if (rc < 0)
+				return rc;
+
+			rest -= end - end_down;
+		}
+
+		WARN_ON(rest > 0);
 
 		/* Update the program header offset */
-		phdr_ptr->p_offset = vmcore_off + (paddr - start);
+		phdr_ptr->p_offset = vmcore_off + (start - start_down);
 		vmcore_off = vmcore_off + size;
+		if (reuse) {
+			phdr_ptr->p_offset -= PAGE_SIZE;
+			vmcore_off -= PAGE_SIZE;
+		}
 	}
 	return 0;
 }
@@ -1100,6 +1277,8 @@ void vmcore_cleanup(void)
 
 		m = list_entry(pos, struct vmcore, list);
 		list_del(&m->list);
+		if ((m->flags & VMCORE_2ND_KERNEL))
+			vfree(m->buf);
 		kfree(m);
 	}
 	free_elfcorebuf();
diff -puN include/linux/kcore.h~vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel include/linux/kcore.h
--- a/include/linux/kcore.h~vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel
+++ a/include/linux/kcore.h
@@ -19,11 +19,15 @@ struct kcore_list {
 	int type;
 };
 
+#define VMCORE_2ND_KERNEL 0x1
+
 struct vmcore {
 	struct list_head list;
 	unsigned long long paddr;
 	unsigned long long size;
 	loff_t offset;
+	char *buf;
+	unsigned long flags;
 };
 
 #ifdef CONFIG_PROC_KCORE
_

Patches currently in -mm which might be from d.hatayama@xxxxxxxxxxxxxx are

procfs-also-fix-proc_reg_get_unmapped_area-for-mmu-case.patch
procfs-also-fix-proc_reg_get_unmapped_area-for-mmu-case-fix.patch
vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel.patch
vmcore-copy-fractional-pages-into-buffers-in-the-kdump-2nd-kernel-fix.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html