Re: [PATCH v3] sched/topology: Enable topology_span_sane check only for debug builds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2/6/2025 8:54 PM, Valentin Schneider wrote:
On 06/02/25 14:40, K Prateek Nayak wrote:
What topology_span_sane() does is, it iterates over all the CPUs at a
given topology level and makes sure that the cpumask for a CPU at
that domain is same as the cpumask of every other CPU set on that mask
for that topology level.

If two CPUs are set on a mask, they should have the same mask. If CPUs
are not set on each other's mask, the masks should be disjoint.

On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's
shared masks iff match_*() returns true:

o For SMT, this means:

    - If X86_FEATURE_TOPOEXT is set:
      - pkg_id must match.
      - die_id must match.
      - amd_node_id must match.
      - llc_id must match.
      - Either core_id or cu_id must match. (*)
      - NUMA nodes must match.

    - If !X86_FEATURE_TOPOEXT:
      - pkg_id must match.
      - die_id must match.
      - core_id must match.
      - NUMA nodes must match.

o For CLUSTER this means:

    - If l2c_id is not populated (== BAD_APICID)
      - Same conditions as SMT.

    - If l2c_id is populated (!= BAD_APICID)
      - l2c_id must match.
      - NUMA nodes must match.

o For MC it means:

    - llc_id must be populated (!= BAD_APICID) and must match.
    - If INTEL_SNC: pkg_id must match.
    - If !INTEL_SNC: NUMA nodes must match.

o For PKG domain:

    - Inserted only if !x86_has_numa_in_package.
    - CPUs should be in same NUMA node.

All in all, other that the one (*) decision point, everything else has
to strictly match for CPUs to be set in each other's CPU mask. And if
they match with one CPU, they should match will all other CPUs in mask
and it they mismatch with one, they should mismatch with all leading
to link_mask() never being called.


Nice summary, thanks for that - I'm not that familiar with the x86 topology
faff.


This is why I think that the topology_span_sane() check is redundant
when the x86 bits have already ensured masks cannot overlap in all
cases except for potentially in the (*) case.

So circling back to my original question around "SDTL_ARCH_VERIFIED",
would folks be okay to an early bailout from topology_span_sane() on:

      if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED))
       return;

and more importantly, do folks care enough about topology_span_sane()
to have it run on other architectures and not just have it guarded
behind just "sched_debug()" which starts off as false by default?


If/when possible I prefer to have sanity checks run unconditionally, as
long as they don't noticeably impact runtime. Unfortunately this does show
up in the boot time, though Steve had a promising improvement for that.

Anyway, if someone gets one of those hangs on a

   do { } while (group != sd->groups)

they'll quickly turn on sched_verbose (or be told to) and the sanity check
will holler at them, so I'm not entirely against it.


Thanks for the feedback :)

Regards,
Naman




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux