On 06/02/25 14:40, K Prateek Nayak wrote: > What topology_span_sane() does is, it iterates over all the CPUs at a > given topology level and makes sure that the cpumask for a CPU at > that domain is same as the cpumask of every other CPU set on that mask > for that topology level. > > If two CPUs are set on a mask, they should have the same mask. If CPUs > are not set on each other's mask, the masks should be disjoint. > > On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's > shared masks iff match_*() returns true: > > o For SMT, this means: > > - If X86_FEATURE_TOPOEXT is set: > - pkg_id must match. > - die_id must match. > - amd_node_id must match. > - llc_id must match. > - Either core_id or cu_id must match. (*) > - NUMA nodes must match. > > - If !X86_FEATURE_TOPOEXT: > - pkg_id must match. > - die_id must match. > - core_id must match. > - NUMA nodes must match. > > o For CLUSTER this means: > > - If l2c_id is not populated (== BAD_APICID) > - Same conditions as SMT. > > - If l2c_id is populated (!= BAD_APICID) > - l2c_id must match. > - NUMA nodes must match. > > o For MC it means: > > - llc_id must be populated (!= BAD_APICID) and must match. > - If INTEL_SNC: pkg_id must match. > - If !INTEL_SNC: NUMA nodes must match. > > o For PKG domain: > > - Inserted only if !x86_has_numa_in_package. > - CPUs should be in same NUMA node. > > All in all, other that the one (*) decision point, everything else has > to strictly match for CPUs to be set in each other's CPU mask. And if > they match with one CPU, they should match will all other CPUs in mask > and it they mismatch with one, they should mismatch with all leading > to link_mask() never being called. > Nice summary, thanks for that - I'm not that familiar with the x86 topology faff. > This is why I think that the topology_span_sane() check is redundant > when the x86 bits have already ensured masks cannot overlap in all > cases except for potentially in the (*) case. > > So circling back to my original question around "SDTL_ARCH_VERIFIED", > would folks be okay to an early bailout from topology_span_sane() on: > > if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED)) > return; > > and more importantly, do folks care enough about topology_span_sane() > to have it run on other architectures and not just have it guarded > behind just "sched_debug()" which starts off as false by default? > If/when possible I prefer to have sanity checks run unconditionally, as long as they don't noticeably impact runtime. Unfortunately this does show up in the boot time, though Steve had a promising improvement for that. Anyway, if someone gets one of those hangs on a do { } while (group != sd->groups) they'll quickly turn on sched_verbose (or be told to) and the sanity check will holler at them, so I'm not entirely against it. > (Sorry for the long answer explaining my thought process.) > >> >> That I can't remember, sorry :/ > > -- > Thanks and Regards, > Prateek