On 06.05.24 11:39, Baolin Wang wrote:
Ccing David.
On 2024/5/3 00:02, Markus Gothe wrote:
Hi,
under some rare occasion I run into the following crash:
[ 41.417606] pstate: 80400005 (Nzcv daif +PAN -UAO)
[ 41.422406] pc : set_pfnblock_flags_mask+0x50/0x94
[ 41.427193] lr : compaction_alloc+0x220/0x804
[ 41.431544] sp : ffffffc01104bb10
[ 41.434852] x29: ffffffc01104bb10 x28: ffffffc010e5b500
[ 41.440165] x27: 0000000000098000 x26: ffffffc010e5b500
[ 41.445477] x25: 0000000000000066 x24: 0000000000090800
[ 41.450789] x23: 0000000000000200 x22: 0000000000084000
[ 41.456093] x21: ffffffc010e82000 x20: ffffffc010b88000
[ 41.461396] x19: ffffffc01104bd70 x18: 0000000000000000
[ 41.466700] x17: f1f24e35df34dda4 x16: 6b3f63a0e1157268
[ 41.472004] x15: 4b3990ec2568ada0 x14: 757ebc126939cb5f
[ 41.477308] x13: 9df9488aba179ccb x12: 0000000000000000
[ 41.482612] x11: 0000000000000000 x10: ffffffc010c5fc30
[ 41.487916] x9 : ffffff801eea7c00 x8 : 000000001bf00000
[ 41.493219] x7 : 0000000000000000 x6 : 000000000000003f
[ 41.498525] x5 : 0000000000000108 x4 : 1000000000000000
[ 41.503835] x3 : 0000000000000021 x2 : 000000000000003c
[ 41.509139] x1 : 0000000000000001 x0 : 0000000000000003
[ 41.514443] Call trace:
[ 41.516887] set_pfnblock_flags_mask+0x50/0x94
[ 41.521330] migrate_pages+0x90/0x7f0
[ 41.524992] compact_zone+0x854/0x9f0
[ 41.528647] kcompactd_do_work+0x168/0x230
[ 41.532734] kcompactd+0x58/0x140
[ 41.536043] kthread+0x120/0x124
[ 41.539263] ret_from_fork+0x10/0x24
[ 41.542835] Code: d346fc43 4b0000c2 8b030ce5 9ac22084 (f86378e0)
[ 41.548925] ---[ end trace 731400a587304db3 ]---
I've pin-pointed it down to pageblock_flags pointer being initialized to NULL under certain conditions. I don't know why this happens.
Maybe it is some obscure race condition which only shows up on my system.
Is there memory hotplug in your test? It seems to be caused by the race
between memory hotplug and PFN walkers (such as compaction), which is
already a known issue.
I think I've never seen races with access to pageblocks but only with
access to the memmap.
Further, I'd not expect races during migrate_pages()? We're holding a
reference do all folios when calling migrate_pages(). So memory
offlining+removal would not be able to succeed until we drop these
references.
But, could it be that we failing during compaction_alloc() [lr :
compaction_alloc+0x220/0x804] and have an issue during
set_pfnblock_flags_mask() on a page that sits on the isolated freelist?
Similarly, memory hotunplug should not be able to mess up here.
[again, racing with memory hotunplug is unlikely]
On which kernel did we start seeing this issue?
--
Cheers,
David / dhildenb