On 9/13/2018 5:53 AM, Brice Goglin wrote:
Le 13/09/2018 à 11:35, Sudeep Holla a écrit :
On Thu, Sep 13, 2018 at 10:39:10AM +0100, James Morse wrote:
Hi Brice,
On 13/09/18 06:51, Brice Goglin wrote:
Le 12/09/2018 à 11:49, Sudeep Holla a écrit :
Yes. Without this change, we hit the lscpu error in the commit message,
and get zero output about the system. We don't even get information
about the caches which are architecturally specified or how many cpus
are present. With this change, we get what we expect out of lscpu (and
also lstopo) including the cache(s) which are not architecturally
specified.
lscpu and lstopo are so broken. They just assume everything on CPU0.
If you hotplug them out, you start seeing issues. So reading and file
that doesn't exist and then bail out on other essential info though they
are present, hmmm ...
Can you elaborate?
I am not sure cpu0 is supposed to be offlineable on Linux. There's no
"online" file in /sys/devices/system/cpu/cpu0. That's why former lstopo
doesn't like CPU0 being hotplugged out. We are actually making that case
work for another non-standard corner case. But offlining "cpu0" this is
considered "normal", somebody must add that missing "online" sysfs
attribute for "cpu0" (change
https://elixir.bootlin.com/linux/latest/source/drivers/base/cpu.c#L375).
On x86 you can't normally offline CPU0, its something to do with certain
interrupts always being routed to CPU0, (oh, and hibernate).
You should be able to enable this behaviour with 'cpu0_hotplug' on the kernel
command line.
(Kconfig's CONFIG_BOOTPARAM_HOTPLUG_CPU0 and CONFIG_DEBUG_HOTPLUG_CPU0 are also
worth a look)
On arm64 at least, cpu0 is just like the others, and can be offlined.
Thanks James, for providing all the details.
To add to the issues I spotted with lscpu/lstopo around topology, it ignores
the updates to topology sibling masks when CPUs are hotplugged in and out.
We have following in lscpu:
add_summary_n(tb, _("Core(s) per socket:"),
cores_per_socket ?: desc->ncores / desc->nsockets);
Now when cores_per_socket = 1, (i.e when we don't have procfs entry),
if ncores = (ncores_max - few_cpus_hotplugged_out), core(s) per socket
will get computed as less than the actual number.
IMO lscpu should be used only when all CPUs are online and it should have
a warning when all cores are not online.
By the way, did anybody actually see an error with lstopo when there's
no "type" attribute for L3? I can't reproduce any issue, we just skip
that specific cache entirely, but everything else appears. If you guys
want to make that "no_cache" cache appear, I'll make it a Unified cache
unless you tell me what to show :)
IIUC, Jeffrey Hugo did see error as per his initial message:
"
This fixes the following lscpu issue where only the cache type sysfs file
is missing which results in no output providing a poor user experience in
the above system configuration.
lscpu: cannot open /sys/devices/system/cpu/cpu0/cache/index3/type: No such
file or directory
"
I don't know about lscpu (it's a different project), but lstopo
shouldn't have any such problem.
If you see an issue with lstopo, I'd be interesting in getting the
tarball generated by hwloc-gather-topology (it dumps useful files from
procfs and sysfs so that we may debug offline).
No error was reported with lstopo, but we don't see the cache as
expected. Fixing the type results in the expected lstopo output. This
seems consistent with your expectations.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.