On Mon, Feb 22, 2021 at 12:40:36PM -0500, Konrad Rzeszutek Wilk wrote: > On Mon, Feb 22, 2021 at 05:39:29PM +0100, David Hildenbrand wrote: > > On 22.02.21 17:13, David Hildenbrand wrote: > > > On 22.02.21 16:13, George Kennedy wrote: > > > > > > > > > > > > On 2/22/2021 4:52 AM, David Hildenbrand wrote: > > > > > On 20.02.21 00:04, George Kennedy wrote: > > > > > > > > > > > > > > > > > > On 2/19/2021 11:45 AM, George Kennedy wrote: > > > > > > > > > > > > > > > > > > > > > On 2/18/2021 7:09 PM, Andrey Konovalov wrote: > > > > > > > > On Fri, Feb 19, 2021 at 1:06 AM George Kennedy > > > > > > > > <george.kennedy@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2/18/2021 3:55 AM, David Hildenbrand wrote: > > > > > > > > > > On 17.02.21 21:56, Andrey Konovalov wrote: > > > > > > > > > > > During boot, all non-reserved memblock memory is exposed to the > > > > > > > > > > > buddy > > > > > > > > > > > allocator. Poisoning all that memory with KASAN lengthens boot > > > > > > > > > > > time, > > > > > > > > > > > especially on systems with large amount of RAM. This patch makes > > > > > > > > > > > page_alloc to not call kasan_free_pages() on all new memory. > > > > > > > > > > > > > > > > > > > > > > __free_pages_core() is used when exposing fresh memory during > > > > > > > > > > > system > > > > > > > > > > > boot and when onlining memory during hotplug. This patch adds a new > > > > > > > > > > > FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() > > > > > > > > > > > through > > > > > > > > > > > free_pages_prepare() from __free_pages_core(). > > > > > > > > > > > > > > > > > > > > > > This has little impact on KASAN memory tracking. > > > > > > > > > > > > > > > > > > > > > > Assuming that there are no references to newly exposed pages > > > > > > > > > > > before they > > > > > > > > > > > are ever allocated, there won't be any intended (but buggy) > > > > > > > > > > > accesses to > > > > > > > > > > > that memory that KASAN would normally detect. > > > > > > > > > > > > > > > > > > > > > > However, with this patch, KASAN stops detecting wild and large > > > > > > > > > > > out-of-bounds accesses that happen to land on a fresh memory page > > > > > > > > > > > that > > > > > > > > > > > was never allocated. This is taken as an acceptable trade-off. > > > > > > > > > > > > > > > > > > > > > > All memory allocated normally when the boot is over keeps getting > > > > > > > > > > > poisoned as usual. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Andrey Konovalov <andreyknvl@xxxxxxxxxx> > > > > > > > > > > > Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d > > > > > > > > > > Not sure this is the right thing to do, see > > > > > > > > > > > > > > > > > > > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@xxxxxxxxxx > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Reversing the order in which memory gets allocated + used during > > > > > > > > > > boot > > > > > > > > > > (in a patch by me) might have revealed an invalid memory access > > > > > > > > > > during > > > > > > > > > > boot. > > > > > > > > > > > > > > > > > > > > I suspect that that issue would no longer get detected with your > > > > > > > > > > patch, as the invalid memory access would simply not get detected. > > > > > > > > > > Now, I cannot prove that :) > > > > > > > > > Since David's patch we're having trouble with the iBFT ACPI table, > > > > > > > > > which > > > > > > > > > is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". > > > > > > > > > KASAN > > > > > > > > > detects that it is being used after free when ibft_init() accesses > > > > > > > > > the > > > > > > > > > iBFT table, but as of yet we can't find where it get's freed (we've > > > > > > > > > instrumented calls to kunmap()). > > > > > > > > Maybe it doesn't get freed, but what you see is a wild or a large > > > > > > > > out-of-bounds access. Since KASAN marks all memory as freed during the > > > > > > > > memblock->page_alloc transition, such bugs can manifest as > > > > > > > > use-after-frees. > > > > > > > > > > > > > > It gets freed and re-used. By the time the iBFT table is accessed by > > > > > > > ibft_init() the page has been over-written. > > > > > > > > > > > > > > Setting page flags like the following before the call to kmap() > > > > > > > prevents the iBFT table page from being freed: > > > > > > > > > > > > Cleaned up version: > > > > > > > > > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > > > > > index 0418feb..8f0a8e7 100644 > > > > > > --- a/drivers/acpi/osl.c > > > > > > +++ b/drivers/acpi/osl.c > > > > > > @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address > > > > > > pg_off, unsigned long pg_sz) > > > > > > > > > > > > pfn = pg_off >> PAGE_SHIFT; > > > > > > if (should_use_kmap(pfn)) { > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + > > > > > > if (pg_sz > PAGE_SIZE) > > > > > > return NULL; > > > > > > - return (void __iomem __force *)kmap(pfn_to_page(pfn)); > > > > > > + SetPageReserved(page); > > > > > > + return (void __iomem __force *)kmap(page); > > > > > > } else > > > > > > return acpi_os_ioremap(pg_off, pg_sz); > > > > > > } > > > > > > @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address > > > > > > pg_off, void __iomem *vaddr) > > > > > > unsigned long pfn; > > > > > > > > > > > > pfn = pg_off >> PAGE_SHIFT; > > > > > > - if (should_use_kmap(pfn)) > > > > > > - kunmap(pfn_to_page(pfn)); > > > > > > - else > > > > > > + if (should_use_kmap(pfn)) { > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + > > > > > > + ClearPageReserved(page); > > > > > > + kunmap(page); > > > > > > + } else > > > > > > iounmap(vaddr); > > > > > > } > > > > > > > > > > > > David, the above works, but wondering why it is now necessary. kunmap() > > > > > > is not hit. What other ways could a page mapped via kmap() be unmapped? > > > > > > > > > > > > > > > > Let me look into the code ... I have little experience with ACPI > > > > > details, so bear with me. > > > > > > > > > > I assume that acpi_map()/acpi_unmap() map some firmware blob that is > > > > > provided via firmware/bios/... to us. > > > > > > > > > > should_use_kmap() tells us whether > > > > > a) we have a "struct page" and should kmap() that one > > > > > b) we don't have a "struct page" and should ioremap. > > > > > > > > > > As it is a blob, the firmware should always reserve that memory region > > > > > via memblock (e.g., memblock_reserve()), such that we either > > > > > 1) don't create a memmap ("struct page") at all (-> case b) ) > > > > > 2) if we have to create e memmap, we mark the page PG_reserved and > > > > > *never* expose it to the buddy (-> case a) ) > > > > > > > > > > > > > > > Are you telling me that in this case we might have a memmap for the HW > > > > > blob that is *not* PG_reserved? In that case it most probably got > > > > > exposed to the buddy where it can happily get allocated/freed. > > > > > > > > > > The latent BUG would be that that blob gets exposed to the system like > > > > > ordinary RAM, and not reserved via memblock early during boot. > > > > > Assuming that blob has a low physical address, with my patch it will > > > > > get allocated/used a lot earlier - which would mean we trigger this > > > > > latent BUG now more easily. > > > > > > > > > > There have been similar latent BUGs on ARM boards that my patch > > > > > discovered where special RAM regions did not get marked as reserved > > > > > via the device tree properly. > > > > > > > > > > Now, this is just a wild guess :) Can you dump the page when mapping > > > > > (before PageReserved()) and when unmapping, to see what the state of > > > > > that memmap is? > > > > > > > > Thank you David for the explanation and your help on this, > > > > > > > > dump_page() before PageReserved and before kmap() in the above patch: > > > > > > > > [ 1.116480] ACPI: Core revision 20201113 > > > > [ 1.117628] XXX acpi_map: about to call kmap()... > > > > [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > > mapping:0000000000000000 index:0x0 pfn:0xbe453 > > > > [ 1.120381] flags: 0xfffffc0000000() > > > > [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 > > > > 0000000000000000 > > > > [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff > > > > 0000000000000000 > > > > [ 1.124146] page dumped because: acpi_map pre SetPageReserved > > > > > > > > I also added dump_page() before unmapping, but it is not hit. The > > > > following for the same pfn now shows up I believe as a result of setting > > > > PageReserved: > > > > > > > > [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 > > > > [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > > mapping:0000000000000000 index:0x1 pfn:0xbe453 > > > > [ 28.098394] flags: 0xfffffc0001000(reserved) > > > > [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 > > > > 0000000000000000 > > > > [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff > > > > 0000000000000000 > > > > [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set > > > > [ 28.098394] page_owner info is not present (never set?) > > > > [ 28.098394] Modules linked in: > > > > [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66 > > > > [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > > BIOS 0.0.0 02/06/2015 > > > > [ 28.098394] Call Trace: > > > > [ 28.098394] dump_stack+0xdb/0x120 > > > > [ 28.098394] bad_page.cold.108+0xc6/0xcb > > > > [ 28.098394] check_new_page_bad+0x47/0xa0 > > > > [ 28.098394] get_page_from_freelist+0x30cd/0x5730 > > > > [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 > > > > [ 28.098394] ? init_object+0x7e/0x90 > > > > [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 > > > > [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 > > > > [ 28.098394] alloc_pages_vma+0xe2/0x560 > > > > [ 28.098394] do_fault+0x194/0x12c0 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] __handle_mm_fault+0x1650/0x26c0 > > > > [ 28.098394] ? copy_page_range+0x1350/0x1350 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] handle_mm_fault+0x1f9/0x810 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] do_user_addr_fault+0x6f7/0xca0 > > > > [ 28.098394] exc_page_fault+0xaf/0x1a0 > > > > [ 28.098394] asm_exc_page_fault+0x1e/0x30 > > > > [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 > > > > > > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > > > someone is trying to allocate that page with the PG_reserved bit set. > > > This means that the page actually was exposed to the buddy. > > > > > > However, when you SetPageReserved(), I don't think that PG_buddy is set > > > and the refcount is 0. That could indicate that the page is on the buddy > > > PCP list. Could be that it is getting reused a couple of times. > > > > > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > > > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > > > table? Just a guess. > > Nah, ACPI MADT enumerates the table and that is the proper location of it. > > > > ... but I assume ibft_check_device() would bail out on an invalid checksum. > > So the question is, why is this page not properly marked as reserved > > already. > > The ibft_check_device ends up being called as module way way after the > kernel has cleaned the memory. > > The funny thing about iBFT is that (it is also mentioned in the spec) > that the table can resize in memory .. or in the ACPI regions (which ^ reside I presume? > have no E820_RAM and are considered "MMIO" regions). > > Either place is fine, so it can be in either RAM or MMIO :-( I'd say that the tables in this case are in E820_RAM, because with MMIO we wouldn't get to kmap() at the first place. It can be easily confirmed by comparing the problematic address with /proc/iomem. Can't say I have a clue about what's going on there, but the theory that somehow iBFT table does not get PG_Reserved during boot makes sense. Do you see "iBFT found at 0x<addr>" early in the kernel log? I don't know if ACPI relocates the tables, but I could not find anywhere that it reserves the original ones. The memblock_reserve() in acpi_table_upgrade() is merely a part of open coded memblock allocation. -- Sincerely yours, Mike.