Christian Brauner <brauner@xxxxxxxxxx> writes: > On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote: >> Andreas Dilger <adilger@xxxxxxxxx> writes: >> >> > On Apr 13, 2024, at 8:15 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: >> >> >> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote: >> >> >> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is >> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely >> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider >> >>> reserving this page address from allocation to avoid similar issues in >> >>> other parts of the code, as is done with the NULL/0 page address. >> >> >> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit >> >> box. Recall what ERR_PTR() is... >> >> >> >> On any architecture the virtual addresses in range (unsigned long)-512.. >> >> (unsigned long)-1 must never resolve to valid kernel objects. >> >> In other words, any kind of wraparound here is asking for an oops on >> >> attempts to access the elements of buffer - kernel dereference of >> >> (char *)0xfffff000 on a 32bit box is already a bug. >> >> >> >> It might be getting an invalid pointer, but arithmetical overflows >> >> are irrelevant. >> > >> > The original bug report stated that search_buf = 0xfffff000 on entry, >> > and I'd quoted that at the start of my email: >> > >> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <bjorn@xxxxxxxxxx> wrote: >> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at >> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000 >> >> is a sane address. >> > >> > Now that you mention ERR_PTR() it definitely makes sense that this last >> > page HAS to be excluded. >> > >> > So some other bug is passing the bad pointer to this code before this >> > error, or the arch is not correctly excluding this page from allocation. >> >> Yeah, something is off for sure. >> >> (FWIW, I manage to hit this for Linus' master as well.) >> >> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for >> this BT: >> >> [<c01e8b34>] __filemap_add_folio+0x322/0x508 >> [<c01e8d6e>] filemap_add_folio+0x54/0xce >> [<c01ea076>] __filemap_get_folio+0x156/0x2aa >> [<c02df346>] __getblk_slow+0xcc/0x302 >> [<c02df5f2>] bdev_getblk+0x76/0x7a >> [<c03519da>] ext4_getblk+0xbc/0x2c4 >> [<c0351cc2>] ext4_bread_batch+0x56/0x186 >> [<c036bcaa>] __ext4_find_entry+0x156/0x578 >> [<c036c152>] ext4_lookup+0x86/0x1f4 >> [<c02a3252>] __lookup_slow+0x8e/0x142 >> [<c02a6d70>] walk_component+0x104/0x174 >> [<c02a793c>] path_lookupat+0x78/0x182 >> [<c02a8c7c>] filename_lookup+0x96/0x158 >> [<c02a8d76>] kern_path+0x38/0x56 >> [<c0c1cb7a>] init_mount+0x5c/0xac >> [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a >> [<c0c01cce>] prepare_namespace+0x226/0x27c >> [<c0c011c6>] kernel_init_freeable+0x286/0x2a8 >> [<c0b97ab8>] kernel_init+0x2a/0x156 >> [<c0ba22ca>] ret_from_fork+0xe/0x20 >> >> I get a folio where folio_address(folio) == 0xfffff000 (which is >> broken). >> >> Need to go into the weeds here... > > I don't see anything obvious that could explain this right away. Did you > manage to reproduce this on any other architecture and/or filesystem? > > Fwiw, iirc there were a bunch of fs/buffer.c changes that came in > through the mm/ layer between v6.7 and v6.8 that might also be > interesting. But really I'm poking in the dark currently. Thanks for getting back! Spent some more time one it today. It seems that the buddy allocator *can* return a page with a VA that can wrap (0xfffff000 -- pointed out by Nam and myself). Further, it seems like riscv32 indeed inserts a page like that to the buddy allocator, when the memblock is free'd: | [<c024961c>] __free_one_page+0x2a4/0x3ea | [<c024a448>] __free_pages_ok+0x158/0x3cc | [<c024b1a4>] __free_pages_core+0xe8/0x12c | [<c0c1435a>] memblock_free_pages+0x1a/0x22 | [<c0c17676>] memblock_free_all+0x1ee/0x278 | [<c0c050b0>] mem_init+0x10/0xa4 | [<c0c1447c>] mm_core_init+0x11a/0x2da | [<c0c00bb6>] start_kernel+0x3c4/0x6de Here, a page with VA 0xfffff000 is a added to the freelist. We were just lucky (unlucky?) that page was used for the page cache. A nasty patch like: --8<-- diff --git a/mm/mm_init.c b/mm/mm_init.c index 549e76af8f82..a6a6abbe71b0 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve) void __init memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order) { + if ((long)page_address(page) == 0xfffff000L) { + return; // leak it + } if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) { int nid = early_pfn_to_nid(pfn); --8<-- ...and it's gone. I need to think more about what a proper fix is. Regardless; Christian, Al, and Ted can all relax. ;-) Björn