On 2/22/2021 11:13 AM, David Hildenbrand wrote:
On 22.02.21 16:13, George Kennedy wrote:
On 2/22/2021 4:52 AM, David Hildenbrand wrote:
On 20.02.21 00:04, George Kennedy wrote:
On 2/19/2021 11:45 AM, George Kennedy wrote:
On 2/18/2021 7:09 PM, Andrey Konovalov wrote:
On Fri, Feb 19, 2021 at 1:06 AM George Kennedy
<george.kennedy@xxxxxxxxxx> wrote:
On 2/18/2021 3:55 AM, David Hildenbrand wrote:
On 17.02.21 21:56, Andrey Konovalov wrote:
During boot, all non-reserved memblock memory is exposed to the
buddy
allocator. Poisoning all that memory with KASAN lengthens boot
time,
especially on systems with large amount of RAM. This patch makes
page_alloc to not call kasan_free_pages() on all new memory.
__free_pages_core() is used when exposing fresh memory during
system
boot and when onlining memory during hotplug. This patch adds
a new
FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok()
through
free_pages_prepare() from __free_pages_core().
This has little impact on KASAN memory tracking.
Assuming that there are no references to newly exposed pages
before they
are ever allocated, there won't be any intended (but buggy)
accesses to
that memory that KASAN would normally detect.
However, with this patch, KASAN stops detecting wild and large
out-of-bounds accesses that happen to land on a fresh memory page
that
was never allocated. This is taken as an acceptable trade-off.
All memory allocated normally when the boot is over keeps getting
poisoned as usual.
Signed-off-by: Andrey Konovalov <andreyknvl@xxxxxxxxxx>
Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d
Not sure this is the right thing to do, see
https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@xxxxxxxxxx
Reversing the order in which memory gets allocated + used during
boot
(in a patch by me) might have revealed an invalid memory access
during
boot.
I suspect that that issue would no longer get detected with your
patch, as the invalid memory access would simply not get detected.
Now, I cannot prove that :)
Since David's patch we're having trouble with the iBFT ACPI table,
which
is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c".
KASAN
detects that it is being used after free when ibft_init() accesses
the
iBFT table, but as of yet we can't find where it get's freed (we've
instrumented calls to kunmap()).
Maybe it doesn't get freed, but what you see is a wild or a large
out-of-bounds access. Since KASAN marks all memory as freed
during the
memblock->page_alloc transition, such bugs can manifest as
use-after-frees.
It gets freed and re-used. By the time the iBFT table is accessed by
ibft_init() the page has been over-written.
Setting page flags like the following before the call to kmap()
prevents the iBFT table page from being freed:
Cleaned up version:
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 0418feb..8f0a8e7 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -287,9 +287,12 @@ static void __iomem
*acpi_map(acpi_physical_address
pg_off, unsigned long pg_sz)
pfn = pg_off >> PAGE_SHIFT;
if (should_use_kmap(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+
if (pg_sz > PAGE_SIZE)
return NULL;
- return (void __iomem __force *)kmap(pfn_to_page(pfn));
+ SetPageReserved(page);
+ return (void __iomem __force *)kmap(page);
} else
return acpi_os_ioremap(pg_off, pg_sz);
}
@@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address
pg_off, void __iomem *vaddr)
unsigned long pfn;
pfn = pg_off >> PAGE_SHIFT;
- if (should_use_kmap(pfn))
- kunmap(pfn_to_page(pfn));
- else
+ if (should_use_kmap(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+
+ ClearPageReserved(page);
+ kunmap(page);
+ } else
iounmap(vaddr);
}
David, the above works, but wondering why it is now necessary.
kunmap()
is not hit. What other ways could a page mapped via kmap() be
unmapped?
Let me look into the code ... I have little experience with ACPI
details, so bear with me.
I assume that acpi_map()/acpi_unmap() map some firmware blob that is
provided via firmware/bios/... to us.
should_use_kmap() tells us whether
a) we have a "struct page" and should kmap() that one
b) we don't have a "struct page" and should ioremap.
As it is a blob, the firmware should always reserve that memory region
via memblock (e.g., memblock_reserve()), such that we either
1) don't create a memmap ("struct page") at all (-> case b) )
2) if we have to create e memmap, we mark the page PG_reserved and
*never* expose it to the buddy (-> case a) )
Are you telling me that in this case we might have a memmap for the HW
blob that is *not* PG_reserved? In that case it most probably got
exposed to the buddy where it can happily get allocated/freed.
The latent BUG would be that that blob gets exposed to the system like
ordinary RAM, and not reserved via memblock early during boot.
Assuming that blob has a low physical address, with my patch it will
get allocated/used a lot earlier - which would mean we trigger this
latent BUG now more easily.
There have been similar latent BUGs on ARM boards that my patch
discovered where special RAM regions did not get marked as reserved
via the device tree properly.
Now, this is just a wild guess :) Can you dump the page when mapping
(before PageReserved()) and when unmapping, to see what the state of
that memmap is?
Thank you David for the explanation and your help on this,
dump_page() before PageReserved and before kmap() in the above patch:
[ 1.116480] ACPI: Core revision 20201113
[ 1.117628] XXX acpi_map: about to call kmap()...
[ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0xbe453
[ 1.120381] flags: 0xfffffc0000000()
[ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8
0000000000000000
[ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[ 1.124146] page dumped because: acpi_map pre SetPageReserved
I also added dump_page() before unmapping, but it is not hit. The
following for the same pfn now shows up I believe as a result of setting
PageReserved:
[ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453
[ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0
mapping:0000000000000000 index:0x1 pfn:0xbe453
[ 28.098394] flags: 0xfffffc0001000(reserved)
[ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122
0000000000000000
[ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff
0000000000000000
[ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
[ 28.098394] page_owner info is not present (never set?)
[ 28.098394] Modules linked in:
[ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted
5.11.0-3dbd5e3 #66
[ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 0.0.0 02/06/2015
[ 28.098394] Call Trace:
[ 28.098394] dump_stack+0xdb/0x120
[ 28.098394] bad_page.cold.108+0xc6/0xcb
[ 28.098394] check_new_page_bad+0x47/0xa0
[ 28.098394] get_page_from_freelist+0x30cd/0x5730
[ 28.098394] ? __isolate_free_page+0x4f0/0x4f0
[ 28.098394] ? init_object+0x7e/0x90
[ 28.098394] __alloc_pages_nodemask+0x2d8/0x650
[ 28.098394] ? write_comp_data+0x2f/0x90
[ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110
[ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50
[ 28.098394] alloc_pages_vma+0xe2/0x560
[ 28.098394] do_fault+0x194/0x12c0
[ 28.098394] ? write_comp_data+0x2f/0x90
[ 28.098394] __handle_mm_fault+0x1650/0x26c0
[ 28.098394] ? copy_page_range+0x1350/0x1350
[ 28.098394] ? write_comp_data+0x2f/0x90
[ 28.098394] ? write_comp_data+0x2f/0x90
[ 28.098394] handle_mm_fault+0x1f9/0x810
[ 28.098394] ? write_comp_data+0x2f/0x90
[ 28.098394] do_user_addr_fault+0x6f7/0xca0
[ 28.098394] exc_page_fault+0xaf/0x1a0
[ 28.098394] asm_exc_page_fault+0x1e/0x30
[ 28.098394] RIP: 0010:__clear_user+0x30/0x60
I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that
someone is trying to allocate that page with the PG_reserved bit set.
This means that the page actually was exposed to the buddy.
However, when you SetPageReserved(), I don't think that PG_buddy is
set and the refcount is 0. That could indicate that the page is on the
buddy PCP list. Could be that it is getting reused a couple of times.
The PFN 0xbe453 looks a little strange, though. Do we expect ACPI
tables close to 3 GiB ? No idea. Could it be that you are trying to
map a wrong table? Just a guess.
What would be the correct way to reserve the page so that the above
would not be hit?
I would have assumed that if this is a binary blob, that someone
(which I think would be acpi code) reserved via memblock_reserve()
early during boot.
E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve().
acpi_table_upgrade() gets called, but bails out before
memblock_reserve() is called. Thus, it appears no pages are getting
reserved.
503 void __init acpi_table_upgrade(void)
504 {
505 void *data;
506 size_t size;
507 int sig, no, table_nr = 0, total_offset = 0;
508 long offset = 0;
509 struct acpi_table_header *table;
510 char cpio_path[32] = "kernel/firmware/acpi/";
511 struct cpio_data file;
512
513 if
(IS_ENABLED(CONFIG_ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD)) {
514 data = __initramfs_start;
515 size = __initramfs_size;
516 } else {
517 data = (void *)initrd_start;
518 size = initrd_end - initrd_start;
519 }
520
521 if (data == NULL || size == 0)
522 return;
523
524 for (no = 0; no < NR_ACPI_INITRD_TABLES; no++) {
525 file = find_cpio_data(cpio_path, data, size,
&offset);
526 if (!file.data)
527 break;
...
563 all_tables_size += table->length;
564 acpi_initrd_files[table_nr].data = file.data;
565 acpi_initrd_files[table_nr].size = file.size;
566 table_nr++;
567 }
568 if (table_nr == 0)
569 return; <--
bails out here
"drivers/acpi/tables.c"
George