> -----Original Message----- > From: Dave Hansen [mailto:dave.hansen@xxxxxxxxx] > Sent: Friday, April 30, 2021 9:39 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; tiantao (H) > <tiantao6@xxxxxxxxxxxxx>; corbet@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx > Cc: linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rafael J. > Wysocki <rafael@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Valentin > Schneider <valentin.schneider@xxxxxxx>; Dave Hansen > <dave.hansen@xxxxxxxxxxxxxxx>; Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> > Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue > of sysfs pagebuf > > On 4/29/21 2:08 PM, Song Bao Hua (Barry Song) wrote: > >> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or > >> something? > > This kind of cpu list ABIs have been there for many years but have > > never been documented well. > > > > We have two ABIs: > > xxx_cpus - in format like 3333333333 > > xxx_cpus_list - in format like 0,3,5,7,9,11,13.... > > > > xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't > > include any more useful information than xxx_cpus. > > > > xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS > > in kconfig nowadays. > > > > if people all agree the trimmed list is a break of ABI, I think we may > > totally remove this list. For these days, this list probably has never > > overflowed but literally this could happen. > > > > thoughts? > > From what Greg said, it sounds like removing the BUILD_BUG_ON(), making > it a binary sysfs file, and making it support arbitrary lengths is the > way to go. I am actually more concerned on xxx_cpus_list than xxx_cpus. xxx_cpus has never overflowed. Though there is a BUILD_BUG_ON(), but the maximum NR_CPUS is 8096, it is still far from overflow. 8096 /32 * 9 = 2277 as 2277 < 4096 While NR_CPUS gets to ~14500, for a 4KB page, xxx_cpus will overflow. But I don't know when hardware will reach there. If we reach there, the existing code to describe topology and schedule tasks might also need rework. On the other hand, lscpu, hwloc, numactl etc are using the existing hex bitmap ABI: $ strace lscpu 2>&1 | grep topology faccessat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/topology/thread_siblings", F_OK) = 0 openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/topology/thread_siblings", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/topology/core_siblings", O_RDONLY|O_CLOEXEC) = 3 faccessat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/topology/book_siblings", F_OK) = -1 ENOENT (No such file or directory) faccessat(AT_FDCWD, "/sys/devices/system/cpu/cpu1/topology/thread_siblings", F_OK) = 0 openat(AT_FDCWD, ... $ strace numactl --hardware 2>&1 | grep cpu openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3 openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3 openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3 openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3 If we move to binary, it means we have to change those applications. For this moment, I'd argue we keep cpu bitmap ABI as is and defer this issue to when we really get so many cores. Thanks Barry