On 06/05/2018 07:22 AM, Dan Williams wrote: > On Mon, Jun 4, 2018 at 8:32 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: >> [ adding KASAN devs...] >> >> On Mon, Jun 4, 2018 at 4:40 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: >>> On Sun, Jun 3, 2018 at 6:48 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: >>>> On Sun, Jun 3, 2018 at 5:25 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> On Mon, Jun 04, 2018 at 08:20:38AM +1000, Dave Chinner wrote: >>>>>> On Thu, May 31, 2018 at 09:02:52PM -0700, Dan Williams wrote: >>>>>>> On Thu, May 31, 2018 at 7:24 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>>>> On Thu, May 31, 2018 at 06:57:33PM -0700, Dan Williams wrote: >>>>>>>>>> FWIW, XFS+DAX used to just work on this setup (I hadn't even >>>>>>>>>> installed ndctl until this morning!) but after changing the kernel >>>>>>>>>> it no longer works. That would make it a regression, yes? >>>>>> >>>>>> [....] >>>>>> >>>>>>>>> I suspect your kernel does not have CONFIG_ZONE_DEVICE enabled which >>>>>>>>> has the following dependencies: >>>>>>>>> >>>>>>>>> depends on MEMORY_HOTPLUG >>>>>>>>> depends on MEMORY_HOTREMOVE >>>>>>>>> depends on SPARSEMEM_VMEMMAP >>>>>>>> >>>>>>>> Filesystem DAX now has a dependency on memory hotplug? >>>>>> >>>>>> [....] >>>>>> >>>>>>>> OK, works now I've found the magic config incantantions to turn >>>>>>>> everything I now need on. >>>>>> >>>>>> By enabling these options, my test VM now has a ~30s pause in the >>>>>> boot very soon after the nvdimm subsystem is initialised. >>>>>> >>>>>> [ 1.523718] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled >>>>>> [ 1.550353] 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A >>>>>> [ 1.552175] Non-volatile memory driver v1.3 >>>>>> [ 2.332045] tsc: Refined TSC clocksource calibration: 2199.909 MHz >>>>>> [ 2.333280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb5dcd4620, max_idle_ns: 440795264143 ns >>>>>> [ 37.217453] brd: module loaded >>>>>> [ 37.225423] loop: module loaded >>>>>> [ 37.228441] virtio_blk virtio2: [vda] 10485760 512-byte logical blocks (5.37 GB/5.00 GiB) >>>>>> [ 37.245418] virtio_blk virtio3: [vdb] 146800640 512-byte logical blocks (75.2 GB/70.0 GiB) >>>>>> [ 37.255794] virtio_blk virtio4: [vdc] 1073741824000 512-byte logical blocks (550 TB/500 TiB) >>>>>> [ 37.265403] nd_pmem namespace1.0: unable to guarantee persistence of writes >>>>>> [ 37.265618] nd_pmem namespace0.0: unable to guarantee persistence of writes >>>>>> >>>>>> The system does not appear to be consuming CPU, but it is blocking >>>>>> NMIs so I can't get a CPU trace. For a VM that I rely on booting in >>>>>> a few seconds because I reboot it tens of times a day, this is a >>>>>> problem.... >>>>> >>>>> And when I turn on KASAN, the kernel fails to boot to a login prompt >>>>> because: >>>> >>>> What's your qemu and kernel command line? I'll take look at this first >>>> thing tomorrow. >>> >>> I was able to reproduce this crash by just turning on KASAN... >>> investigating. It would still help to have your config for our own >>> regression testing purposes it makes sense for us to prioritize >>> "Dave's test config", similar to the priority of not breaking Linus' >>> laptop. >> >> I believe this is a bug in KASAN, or a bug in devm_memremap_pages(), >> depends on your point of view. At the very least it is a mismatch of >> assumptions. KASAN learns of hot added memory via the memory hotplug >> notifier. However, the devm_memremap_pages() implementation is >> intentionally limited to the "first half" of the memory hotplug >> procedure. I.e. it does just enough to setup the linear map for >> pfn_to_page() and initialize the "struct page" memmap, but then stops >> short of onlining the pages. This is why we are getting a NULL ptr >> deref and not a KASAN report, because KASAN has no shadow area setup >> for the linearly mapped pmem range. >> >> In terms of solving it we could refactor kasan_mem_notifier() so that >> devm_memremap_pages() can call it outside of the notifier... I'll give >> this a shot. > > Well, the attached patch got me slightly further, but only slightly... > > [ 14.998394] BUG: KASAN: unknown-crash in pmem_do_bvec+0x19e/0x790 [nd_pmem] > [ 15.000006] Read of size 4096 at addr ffff880200000000 by task > systemd-udevd/915 > [ 15.001991] > [ 15.002590] CPU: 15 PID: 915 Comm: systemd-udevd Tainted: G > OE 4.17.0-rc5+ #1 > 982 > [ 15.004783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.11.1-0-g0551a > 4be2c-prebuilt.qemu-project.org 04/01/2014 > [ 15.007652] Call Trace: > [ 15.008339] dump_stack+0x9a/0xeb > [ 15.009344] print_address_description+0x73/0x280 > [ 15.010524] kasan_report+0x258/0x380 > [ 15.011528] ? pmem_do_bvec+0x19e/0x790 [nd_pmem] > [ 15.012747] memcpy+0x1f/0x50 > [ 15.013659] pmem_do_bvec+0x19e/0x790 [nd_pmem] > > ...I've exhausted my limited kasan internals knowledge, any ideas what > it's missing? > Initialization is missing. kasan_mem_notifier() doesn't initialize shadow because it expects kasan_free_pages()/kasan_alloc_pages() will do that when page allocated/freed. So adding memset(shadow_start, 0, shadow_size); will make this work. But we shouldn't use kasan_mem_notifier here, as that would mean wasting a lot of memory only to store zeroes. A better solution would be mapping kasan_zero_page in shadow. The draft patch bellow demonstrates the idea (build tested only). --- include/linux/kasan.h | 14 ++++++++++++++ kernel/memremap.c | 10 ++++++++++ mm/kasan/kasan_init.c | 46 ++++++++++++++++++++++++++++++++++++---------- 3 files changed, 60 insertions(+), 10 deletions(-) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index de784fd11d12..b5f5d2d9e46f 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -71,6 +71,10 @@ struct kasan_cache { int kasan_module_alloc(void *addr, size_t size); void kasan_free_shadow(const struct vm_struct *vm); +int kasan_add_zero_shadow(unsigned long start, unsigned long size); +void kasan_remove_zero_shadow(unsigned long start, unsigned long size); + + size_t ksize(const void *); static inline void kasan_unpoison_slab(const void *ptr) { ksize(ptr); } size_t kasan_metadata_size(struct kmem_cache *cache); @@ -124,6 +128,16 @@ static inline bool kasan_slab_free(struct kmem_cache *s, void *object, static inline int kasan_module_alloc(void *addr, size_t size) { return 0; } static inline void kasan_free_shadow(const struct vm_struct *vm) {} +static inline int kasan_add_zero_shadow(unsigned long start, unsigned long size) +{ + return 0; +} +static inline int kasan_remove_zero_shadow(unsigned long start, + unsigned long size) +{ + return 0; +} + static inline void kasan_unpoison_slab(const void *ptr) { } static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } diff --git a/kernel/memremap.c b/kernel/memremap.c index 895e6b76b25e..1524dda52667 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -15,6 +15,7 @@ #include <linux/types.h> #include <linux/pfn_t.h> #include <linux/io.h> +#include <linux/kasan.h> #include <linux/mm.h> #include <linux/memory_hotplug.h> #include <linux/swap.h> @@ -309,6 +310,7 @@ static void devm_memremap_pages_release(void *data) mem_hotplug_begin(); arch_remove_memory(align_start, align_size, pgmap->altmap_valid ? &pgmap->altmap : NULL); + kasan_remove_zero_shadow((unsigned long)__va(align_start), align_size); mem_hotplug_done(); untrack_pfn(NULL, PHYS_PFN(align_start), align_size); @@ -395,6 +397,12 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) goto err_pfn_remap; mem_hotplug_begin(); + error = kasan_add_zero_shadow((unsigned long)__va(align_start), align_size); + if (error) { + mem_hotplug_done(); + goto err_kasan; + } + error = arch_add_memory(nid, align_start, align_size, altmap, false); if (!error) move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], @@ -423,6 +431,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) return __va(res->start); err_add_memory: + kasan_remove_zero_shadow((unsigned long)__va(align_start), align_size); + err_kasan: untrack_pfn(NULL, PHYS_PFN(align_start), align_size); err_pfn_remap: err_radix: diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c index f436246ccc79..160d35d28e62 100644 --- a/mm/kasan/kasan_init.c +++ b/mm/kasan/kasan_init.c @@ -21,6 +21,8 @@ #include <asm/page.h> #include <asm/pgalloc.h> +#include "kasan.h" + /* * This page serves two purposes: * - It used as early shadow memory. The entire shadow region populated @@ -41,13 +43,16 @@ pmd_t kasan_zero_pmd[PTRS_PER_PMD] __page_aligned_bss; #endif pte_t kasan_zero_pte[PTRS_PER_PTE] __page_aligned_bss; -static __init void *early_alloc(size_t size, int node) +static void *kasan_alloc(size_t size, int node) { + if (slab_is_available()) + return (void *)get_zeroed_page(GFP_KERNEL | __GFP_NOFAIL); + return memblock_virt_alloc_try_nid(size, size, __pa(MAX_DMA_ADDRESS), BOOTMEM_ALLOC_ACCESSIBLE, node); } -static void __init zero_pte_populate(pmd_t *pmd, unsigned long addr, +static void __ref zero_pte_populate(pmd_t *pmd, unsigned long addr, unsigned long end) { pte_t *pte = pte_offset_kernel(pmd, addr); @@ -63,7 +68,7 @@ static void __init zero_pte_populate(pmd_t *pmd, unsigned long addr, } } -static void __init zero_pmd_populate(pud_t *pud, unsigned long addr, +static void __ref zero_pmd_populate(pud_t *pud, unsigned long addr, unsigned long end) { pmd_t *pmd = pmd_offset(pud, addr); @@ -79,13 +84,13 @@ static void __init zero_pmd_populate(pud_t *pud, unsigned long addr, if (pmd_none(*pmd)) { pmd_populate_kernel(&init_mm, pmd, - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); + kasan_alloc(PAGE_SIZE, NUMA_NO_NODE)); } zero_pte_populate(pmd, addr, next); } while (pmd++, addr = next, addr != end); } -static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr, +static void __ref zero_pud_populate(p4d_t *p4d, unsigned long addr, unsigned long end) { pud_t *pud = pud_offset(p4d, addr); @@ -104,13 +109,13 @@ static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr, if (pud_none(*pud)) { pud_populate(&init_mm, pud, - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); + kasan_alloc(PAGE_SIZE, NUMA_NO_NODE)); } zero_pmd_populate(pud, addr, next); } while (pud++, addr = next, addr != end); } -static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr, +static void __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr, unsigned long end) { p4d_t *p4d = p4d_offset(pgd, addr); @@ -133,7 +138,7 @@ static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr, if (p4d_none(*p4d)) { p4d_populate(&init_mm, p4d, - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); + kasan_alloc(PAGE_SIZE, NUMA_NO_NODE)); } zero_pud_populate(p4d, addr, next); } while (p4d++, addr = next, addr != end); @@ -145,7 +150,7 @@ static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr, * @shadow_start - start of the memory range to populate * @shadow_end - end of the memory range to populate */ -void __init kasan_populate_zero_shadow(const void *shadow_start, +void __ref kasan_populate_zero_shadow(const void *shadow_start, const void *shadow_end) { unsigned long addr = (unsigned long)shadow_start; @@ -192,8 +197,29 @@ void __init kasan_populate_zero_shadow(const void *shadow_start, if (pgd_none(*pgd)) { pgd_populate(&init_mm, pgd, - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); + kasan_alloc(PAGE_SIZE, NUMA_NO_NODE)); } zero_p4d_populate(pgd, addr, next); } while (pgd++, addr = next, addr != end); } + +int kasan_add_zero_shadow(unsigned long start, unsigned long size) +{ + unsigned long shadow_start, shadow_end; + + shadow_start = (unsigned long)kasan_mem_to_shadow((void *)start); + shadow_end = shadow_start + (size >> KASAN_SHADOW_SCALE_SHIFT); + + if (WARN_ON(start % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE)) || + WARN_ON(size % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE))) + return -EINVAL; + + kasan_populate_zero_shadow((void *)shadow_start, + (void *)shadow_end); + return 0; +} + +void kasan_remove_zero_shadow(unsigned long start, unsigned long size) +{ + /* TODO */ +} --