Re: [PATCH v3 0/3] x86/cacheinfo: Set the number of leaves per CPU

Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx> · Mon, 11 Sep 2023 20:23:50 -0700

On Fri, Sep 01, 2023 at 09:52:54AM +0200, Andreas Herrmann wrote:
> On Fri, Sep 01, 2023 at 08:50:31AM +0200, Andreas Herrmann wrote:
> > On Fri, Aug 04, 2023 at 06:24:18PM -0700, Ricardo Neri wrote:
> > > Hi,
> > 
> > Hi Ricardo,
> > 
> > > This is v3 of a patchset to set the number of cache leaves independently
> > > for each CPU. v1 and v2 can be found here [1] and here [2].
> > 
> > I am on CC of your patch set and glanced through it.
> > Long ago I've touched related code but now I am not really up-to-date
> > to do a qualified review in this area. First, I would have to look
> > into documentation to refresh my memory etc. pp.
> > 
> > I've not seen (or it escaped me) information that this was tested on a
> > variety of machines that might be affected by this change. And there
> > are no Tested-by-tags.
> > 
> > Even if changes look simple and reasonable they can cause issues.
> > 
> > Thus from my POV it would be good to have some information what tests
> > were done. I am not asking to test on all possible systems but just
> > knowing which system(s) was (were) used for functional testing is of
> > value.
> 
> Doing a good review -- trying to catch every flaw -- is really hard to
> do. Especially when you are not actively doing development work in an
> area.
> 
> For example see
> 
>   commit e33b93650fc5 ("blk-iocost: Pass gendisk to ioc_refresh_params")
>   [Breno Leitao <leitao@xxxxxxxxxx>, Tue Feb 28 03:16:54 2023 -0800]
> 
> This fixes commit
> 
>   ce57b558604e ("blk-rq-qos: make rq_qos_add and rq_qos_del more
>   useful") [Christoph Hellwig <hch@xxxxxx>, Fri Feb 3 16:03:54 2023
>   +0100]
> 
> I had reviewed the latter (see
> https://marc.info/?i=Y8plg6OAa4lrnyZZ@suselix) and the entire patch
> series. I've compared the original code with the patch and walked
> through every single hunk of the diff and tried to think it
> through. The changes made sense to me. Then came the bug report(s) and
> I felt that I had failed tremendously. To err is human.
> 
> What this shows (and it is already known): with every patch new errors
> are potentially introduced in the kernel. Functional, and higher
> level testing can help to spot them before a kernel is deployed in the
> field.
> 
> At a higher level view this proves another thing.
> Linux kernel development is a transparent example of
> "peer-review-process".
> 
> In our scientific age it is often postulated that peer review is the
> way to go[1] and that it kind of guarantees that what's published, or
> rolled out, is reasonable and "works".
> 
> The above sample shows that this process will not catch all flaws and
> that proper, transparent and reliable tests are required before
> anything is deployed in the field.
> 
> This is true for every branch of science.
> 
> If it is purposely not done something is fishy.
> 
> 
> [1] Also some state that it is "the only way to go" and every thing
> figured out without a peer-review-process is false and can't be
> trusted. Of course this is a false statement.

Hi Andreas,

Agreed. Testing is important. For the specific case of these patches, I
booted CONFIG_PREEMPT_RT and !CONFIG_PREEMPT_RT kernels. Then I
  a) Ensured that the splat reported in commit 5944ce092b97
     ("arch_topology: Build cacheinfo from primary CPU") was not observed.

  b) Ensured that /sys/devices/system/cpu/cpuX/cache is present.

  c) Ensured that the contents /sys/devices/system/cpu/cpuX/cache is the
     same before and after my patches.

I tested on the following systems: Intel Alder Lake, Intel Meteor
Lake, 2-socket Intel Icelake server, 2-socket Intel Cascade Lake server,
2-socket Intel Skylake server, 4-socket Intel Broadwell server, 2-socket
Intel Haswell server, 2-socket AMD Rome server, and 2-socket AMD Milan
server.

Thanks and BR,
Ricardo