Re: [RFC PATCH 00/14] Rearrange batched folio freeing

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 6 Sep 2023 11:23:03 +0100

On 06/09/2023 04:48, Matthew Wilcox wrote:
> On Tue, Sep 05, 2023 at 03:00:51PM +0100, Matthew Wilcox wrote:
>> On Tue, Sep 05, 2023 at 02:26:54PM +0100, Ryan Roberts wrote:
>>> On 05/09/2023 14:15, Matthew Wilcox wrote:
>>>> On Mon, Sep 04, 2023 at 02:25:41PM +0100, Ryan Roberts wrote:
>>>>> I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?
>>>>
>>>> I didn't hit that with my testing.  Admittedly I was using xfs rather
>>>> than ext4, but ...
>>>
>>> I've only seen it once.
>>>
>>> I have a bit of a hybrid setup - my rootfs is xfs (and using large folios), but
>>> the linux tree (which is being built during the benchmark) is on an ext4
>>> partition. Large anon folios is enabled in this config, so there will be plenty
>>> of large folios in the system.
>>>
>>> I'm not sure if the fact that this fired from the ext4 path is too relevant -
>>> the page with the dodgy index is already on the PCP list so may or may not be large.
>>
>> Indeed.  I have a suspicion that this may be more common, but if pages
>> are commonly freed to and allocated from the PCP list without ever being
>> transferred to the free list, we'll never see it.  Perhaps adding a
>> check when pages are added to the PCP list that page->index is less
>> than 8 would catch the miscreant relatively quickly?
> 
> Somehow my qemu setup started working again.  This stuff is black magic.
> 
> Anyway, I did this:
> 
> +++ b/mm/page_alloc.c
> @@ -2405,6 +2405,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
> 
>         __count_vm_events(PGFREE, 1 << order);
>         pindex = order_to_pindex(migratetype, order);
> +       VM_BUG_ON_PAGE(page->index > 7, page);
>         list_add(&page->pcp_list, &pcp->lists[pindex]);
>         pcp->count += 1 << order;
> 
> 
> but I haven't caught a wascally wabbit yet after an hour of running
> xfstests.  I think that's the only place we add a page to the
> pcp->lists.

I added a smattering of VM_BUG_ON_PAGE(page->index > 5, page) to the places where the page is added and removed from the pcp lists. And one triggered on removing the page from the list (the same place I saw the UBSAN oops previously). But there is no page info dumped! I've enabled CONFIG_DEBUG_VM (and friends). I can't see how its possible to get the BUG report but not the dump_page() bit - what am I doing wrong?

Anyway, the fact that it did not trigger on insertion into the list suggests this is a corruption issue? I'll keep trying...

[  334.307831] kernel BUG at mm/page_alloc.c:1217!
[  334.312351] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[  334.318433] Modules linked in: nfs lockd grace sunrpc fscache netfs nls_iso8859_1 scsi_dh_rdac scsi_dh_emc scsi_dh_alua drm xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core mlx5_core mlxfw pci_hyperv_intf crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce tls nvme psample nvme_core aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  334.359704] CPU: 26 PID: 260858 Comm: git Not tainted 6.5.0-rc4-ryarob01-all-debug #1
[  334.367521] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
[  334.380285] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  334.387233] pc : free_pcppages_bulk+0x1b0/0x2d0
[  334.391753] lr : free_pcppages_bulk+0x1b0/0x2d0
[  334.396270] sp : ffff80008fe9b810
[  334.399571] x29: ffff80008fe9b810 x28: 0000000000000001 x27: 00000000000000e7
[  334.406694] x26: fffffc2007e07780 x25: ffff08181ed8f840 x24: ffff08181ed8f868
[  334.413817] x23: 0000000000000000 x22: ffff0818836caf80 x21: ffff800081bbf008
[  334.420939] x20: 0000000000000001 x19: ffff08181ed8f850 x18: 0000000000000000
[  334.428061] x17: 6666373066666666 x16: 2066666666666666 x15: 6632303030303030
[  334.435184] x14: 0000000000000000 x13: 2935203e20746d28 x12: 454741505f4e4f5f
[  334.442306] x11: 4755425f4d56203a x10: 6573756163656220 x9 : ffff80008014ef40
[  334.449429] x8 : 5f4d56203a657375 x7 : 6163656220646570 x6 : ffff08181dee0000
[  334.456551] x5 : ffff08181ed78d88 x4 : 0000000000000000 x3 : ffff80008fe9b538
[  334.463674] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000002b
[  334.470796] Call trace:
[  334.473230]  free_pcppages_bulk+0x1b0/0x2d0
[  334.477401]  free_unref_page_commit+0x124/0x2a8
[  334.481918]  free_unref_folios+0x3b4/0x4e8
[  334.486003]  release_unref_folios+0xac/0xf8
[  334.490175]  folios_put+0x100/0x228
[  334.493651]  __folio_batch_release+0x34/0x88
[  334.497908]  truncate_inode_pages_range+0x168/0x690
[  334.502773]  truncate_inode_pages_final+0x58/0x90
[  334.507464]  ext4_evict_inode+0x164/0x900
[  334.511463]  evict+0xac/0x160
[  334.514419]  iput+0x170/0x228
[  334.517375]  do_unlinkat+0x1d0/0x290
[  334.520938]  __arm64_sys_unlinkat+0x48/0x98
[  334.525108]  invoke_syscall+0x74/0xf8
[  334.528758]  el0_svc_common.constprop.0+0x58/0x130
[  334.533536]  do_el0_svc+0x40/0xa8
[  334.536837]  el0_svc+0x2c/0xb8
[  334.539881]  el0t_64_sync_handler+0xc0/0xc8
[  334.544052]  el0t_64_sync+0x1a8/0x1b0
[  334.547703] Code: aa1a03e0 90009dc1 91072021 97ff1097 (d4210000) 
[  334.553783] ---[ end trace 0000000000000000 ]---