On sobota 22. února 2025 12:29:54, středoevropský standardní čas Oleksandr Natalenko wrote: > Hello. > > On pátek 21. února 2025 1:52:59, středoevropský standardní čas Rik van Riel wrote: > > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. > > > > This allows the kernel to invalidate TLB entries on remote CPUs without > > needing to send IPIs, without having to wait for remote CPUs to handle > > those interrupts, and with less interruption to what was running on > > those CPUs. > > > > Because x86 PCID space is limited, and there are some very large > > systems out there, broadcast TLB invalidation is only used for > > processes that are active on 3 or more CPUs, with the threshold > > being gradually increased the more the PCID space gets exhausted. > > > > Combined with the removal of unnecessary lru_add_drain calls > > (see https://lkml.org/lkml/2024/12/19/1388) this results in a > > nice performance boost for the will-it-scale tlb_flush2_threads > > test on an AMD Milan system with 36 cores: > > > > - vanilla kernel: 527k loops/second > > - lru_add_drain removal: 731k loops/second > > - only INVLPGB: 527k loops/second > > - lru_add_drain + INVLPGB: 1157k loops/second > > > > Profiling with only the INVLPGB changes showed while > > TLB invalidation went down from 40% of the total CPU > > time to only around 4% of CPU time, the contention > > simply moved to the LRU lock. > > > > Fixing both at the same time about doubles the > > number of iterations per second from this case. > > > > Some numbers closer to real world performance > > can be found at Phoronix, thanks to Michael: > > > > https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits > > > > My current plan is to implement support for Intel's RAR > > (Remote Action Request) TLB flushing in a follow-up series, > > after this thing has been merged into -tip. Making things > > any larger would just be unwieldy for reviewers. > > > > v12: > > - make sure "nopcid" command line option turns off invlpgb (Brendan) > > - add "noinvlpgb" kernel command line option > > - split out kernel TLB flushing differently (Dave & Yosry) > > - split up the patch that does invlpgb flushing for user processes (Dave) > > - clean up get_flush_tlb_info (Boris) > > - move invlpgb_count_max initialization to get_cpu_cap (Boris) > > - bunch more comments as requested > > Somehow, this iteration breaks resume from S3. I can see it even in a QEMU VM: Can also reproduce this by simply offlining/onlining a CPU via `/sys/devices/system/cpu/cpuX/online`. > > ``` > [ 24.373391] ACPI: PM: Low-level resume complete > [ 24.373929] ACPI: PM: Restoring platform NVS memory > [ 24.375024] Enabling non-boot CPUs ... > [ 24.375777] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 24.376463] BUG: unable to handle page fault for address: ffffffffa3ba4d60 > [ 24.377383] #PF: supervisor write access in kernel mode > [ 24.377912] #PF: error_code(0x0003) - permissions violation > [ 24.378413] PGD 25427067 P4D 25427067 PUD 25428063 PMD 8000000024c001a1 > [ 24.379020] Oops: Oops: 0003 [#1] PREEMPT SMP NOPTI > [ 24.379503] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.14.0-pf0 #1 161e4891fb5044b2d7438cd1852eeaac0cdffab5 > [ 24.380650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 > [ 24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0 > [ 24.381810] Code: 08 c7 44 24 08 00 00 00 00 48 8d 4c 24 0c e8 3c 00 04 00 90 8b 44 24 04 89 43 64 0f b7 44 24 0c 83 c0 01 81 7b 24 09 00 00 80 <66> 89 05 0e ab 8b 01 0f 86 18 fd ff ff c7 44 24 14 00 00 00 00 4c > [ 24.383629] RSP: 0000:ffffafbec00efe70 EFLAGS: 00010012 > [ 24.384155] RAX: 0000000000000001 RBX: ffff8b3fbcb19020 RCX: 0000000000001001 > [ 24.384862] RDX: 0000000000000000 RSI: ffffafbec00efe74 RDI: ffffafbec00efe78 > [ 24.385603] RBP: ffffafbec00efe88 R08: ffffafbec00efe70 R09: ffffafbec00efe7c > [ 24.386318] R10: 0000000000002430 R11: ffff8b3fa5428000 R12: ffffafbec00efe8c > [ 24.387014] R13: ffffafbec00efe84 R14: ffffafbec00efe80 R15: ffffafbec00efe70 > [ 24.387713] FS: 0000000000000000(0000) GS:ffff8b3fbcb00000(0000) knlGS:0000000000000000 > [ 24.388502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 24.389074] CR2: ffffffffa3ba4d60 CR3: 0000000025422000 CR4: 0000000000350ef0 > [ 24.389769] Call Trace: > [ 24.390020] <TASK> > [ 24.392234] identify_cpu+0xd4/0x890 > [ 24.392593] identify_secondary_cpu+0x12/0x40 > [ 24.393032] smp_store_cpu_info+0x49/0x60 > [ 24.393430] start_secondary+0x7f/0x140 > [ 24.393810] common_startup_64+0x13e/0x141 > [ 24.394218] </TASK> > > $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b > get_cpu_cap+0x39b/0x500: > get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063 > > 1060 if (c->extended_cpuid_level >= 0x80000008) { > 1061 cpuid(0x80000008, &eax, &ebx, &ecx, &edx); > 1062 c->x86_capability[CPUID_8000_0008_EBX] = ebx; > 1063 invlpgb_count_max = (edx & 0xffff) + 1; > 1064 } > ``` > > Any idea what I'm looking at? > > Thank you. > > > v11: > > - resolve conflict with CONFIG_PT_RECLAIM code > > - a few more cleanups (Peter, Brendan, Nadav) > > v10: > > - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter) > > - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan) > > - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan) > > - various cleanups (Brendan) > > v9: > > - print warning when start or end address was rounded (Peter) > > - in the reclaim code, tlbsync at context switch time (Peter) > > - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan) > > v8: > > - round start & end to handle non-page-aligned callers (Steven & Jan) > > - fix up changelog & add tested-by tags (Manali) > > v7: > > - a few small code cleanups (Nadav) > > - fix spurious VM_WARN_ON_ONCE in mm_global_asid > > - code simplifications & better barriers (Peter & Dave) > > v6: > > - fix info->end check in flush_tlb_kernel_range (Michael) > > - disable broadcast TLB flushing on 32 bit x86 > > v5: > > - use byte assembly for compatibility with older toolchains (Borislav, Michael) > > - ensure a panic on an invalid number of extra pages (Dave, Tom) > > - add cant_migrate() assertion to tlbsync (Jann) > > - a bunch more cleanups (Nadav) > > - key TCE enabling off X86_FEATURE_TCE (Andrew) > > - fix a race between reclaim and ASID transition (Jann) > > v4: > > - Use only bitmaps to track free global ASIDs (Nadav) > > - Improved AMD initialization (Borislav & Tom) > > - Various naming and documentation improvements (Peter, Nadav, Tom, Dave) > > - Fixes for subtle race conditions (Jann) > > v3: > > - Remove paravirt tlb_remove_table call (thank you Qi Zheng) > > - More suggested cleanups and changelog fixes by Peter and Nadav > > v2: > > - Apply suggestions by Peter and Borislav (thank you!) > > - Fix bug in arch_tlbbatch_flush, where we need to do both > > the TLBSYNC, and flush the CPUs that are in the cpumask. > > - Some updates to comments and changelogs based on questions. > > > > > > > > > -- Oleksandr Natalenko, MSE
Attachment:
signature.asc
Description: This is a digitally signed message part.