On Thu, 11 Aug 2022, John Stultz wrote: > On Thu, Aug 11, 2022 at 11:57 AM John Stultz <jstultz@xxxxxxxxxx> wrote: > > > > On Thu, Aug 11, 2022 at 10:57 AM John Kacur <jkacur@xxxxxxxxxx> wrote: > > > On Thu, 28 Jul 2022, John Stultz wrote: > > > > > > > Using cyclictest without specifying affinity via -a, I was > > > > noticing a strange issue where the rt threads where not > > > > migrating when being blocked. > > > > > > > > After lots of debugging in the kernel, I found its actually an > > > > issue with cyclictest. > > > > > > > > When using -t there is no behavioral difference between specifying > > > > -a or not specifying -a. > > > > > > > > This can be confirmed by adding printf messages around the > > > > pthread_setaffinity_np() call in the threadtest function. > > > > > > > > Currently: > > > > > > > > root@localhost:~/rt-tests# ./cyclictest -t -a -q -D1 > > > > Affining thread 0 to cpu: 0 > > > > Affining thread 1 to cpu: 1 > > > > Affining thread 2 to cpu: 2 > > > > Affining thread 3 to cpu: 3 > > > > Affining thread 4 to cpu: 4 > > > > Affining thread 5 to cpu: 5 > > > > Affining thread 7 to cpu: 7 > > > > Affining thread 6 to cpu: 6 > > > > T: 0 (15034) P: 0 I:1000 C: 1000 Min: 82 Act: 184 Avg: 180 Max: 705 > > > > ... > > > > > > > > root@localhost:~/rt-tests# ./cyclictest -t -q -D1 > > > > Affining thread 0 to cpu: 0 > > > > Affining thread 1 to cpu: 1 > > > > Affining thread 2 to cpu: 2 > > > > Affining thread 3 to cpu: 3 > > > > Affining thread 4 to cpu: 4 > > > > Affining thread 5 to cpu: 5 > > > > Affining thread 6 to cpu: 6 > > > > Affining thread 7 to cpu: 7 > > > > T: 0 (15044) P: 0 I:1000 C: 1000 Min: 74 Act: 144 Avg: 162 Max: 860 > > > > .. > > > > > > > > This issue seems to come from the logic in process_options(): > > > > /* if smp wasn't requested, test for numa automatically */ > > > > if (!smp) { > > > > numa = numa_initialize(); > > > > if (setaffinity == AFFINITY_UNSPECIFIED) > > > > setaffinity = AFFINITY_USEALL; > > > > } > > > > > > > > Here, by setting setaffinity = AFFINITY_USEALL, we effectively > > > > pin each thread to its respective cpu, same as the "-a" option. > > > > > > > > This was most recently introduced in commit bdb8350f1b0b > > > > ("Revert "cyclictest: Use affinity_mask for steering > > > > thread placement""). > > > > > > > > This seems erronious to me, so I wanted to share this patch > > > > which removes the overriding AFFINITY_UNSPECIFIED with > > > > AFFINITY_USEALL by default. Also, some additional tweaks to > > > > preserve the existing numa allocation affinity. > > > > > > > > With this patch, we no longer call pthread_setaffinity_np() in the > > > > "./cyclictest -t -q -D1" case. > > > > > > > > Cc: John Kacur <jkacur@xxxxxxxxxx> > > > > Cc: Connor O'Brien <connoro@xxxxxxxxxx> > > > > Cc: Qais Yousef <qais.yousef@xxxxxxx> > > > > Signed-off-by: John Stultz <jstultz@xxxxxxxxxx> > > > > --- > > > > v2: Fixes error passing cpu = -1 to rt_numa_numa_node_of_cpu() > > > > --- > > > > src/cyclictest/cyclictest.c | 10 ++++++---- > > > > 1 file changed, 6 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c > > > > index decea78..82759d1 100644 > > > > --- a/src/cyclictest/cyclictest.c > > > > +++ b/src/cyclictest/cyclictest.c > > > > @@ -1270,8 +1270,6 @@ static void process_options(int argc, char *argv[], int max_cpus) > > > > /* if smp wasn't requested, test for numa automatically */ > > > > if (!smp) { > > > > numa = numa_initialize(); > > > > - if (setaffinity == AFFINITY_UNSPECIFIED) > > > > - setaffinity = AFFINITY_USEALL; > > > > } > > > > > > > > if (option_affinity) { > > > > @@ -2043,9 +2041,13 @@ int main(int argc, char **argv) > > > > void *stack; > > > > void *currstk; > > > > size_t stksize; > > > > + int node_cpu = cpu; > > > > + > > > > + if (node_cpu == -1) > > > > + node_cpu = cpu_for_thread_ua(i, max_cpus); > > > > > > > > /* find the memory node associated with the cpu i */ > > > > - node = rt_numa_numa_node_of_cpu(cpu); > > > > + node = rt_numa_numa_node_of_cpu(node_cpu); > > > > > > > > /* get the stack size set for this thread */ > > > > if (pthread_attr_getstack(&attr, &currstk, &stksize)) > > > > @@ -2056,7 +2058,7 @@ int main(int argc, char **argv) > > > > stksize = PTHREAD_STACK_MIN * 2; > > > > > > > > /* allocate memory for a stack on appropriate node */ > > > > - stack = rt_numa_numa_alloc_onnode(stksize, node, cpu); > > > > + stack = rt_numa_numa_alloc_onnode(stksize, node, node_cpu); > > > > > > > > /* touch the stack pages to pre-fault them in */ > > > > memset(stack, 0, stksize); > > > > -- > > > > 2.37.1.455.g008518b4e5-goog > > > > > > > > > > > > > > Still thinking about this, I guess if we go with the above change it will > > > mean that to get meaningful numa behaviour you'll need to specify -a > > > > So, I guess what exactly do you mean when you say "meaningful numa behavior"? > > > > My instinct is you want the memory for the thread running allocated on > > the numa node its running on. > > Ideally the kernel should take care of this automatically (doing local > > node allocations and preferring the tasks' local node in the > > scheduler), but I guess if you are trying to setup the initial thread > > distribution across all the nodes, that's a bit hard to do w/o setting > > the affinity. So I sort of expect that if you're wanting to align > > allocations to numa nodes, you probably want to either explicitly set > > the affinity, or not give any guidance and let the kernel make its > > best choice. > > > > I guess there's a fairly recent SCHED_SOFT_AFFINITY, which might > > provide a soft affinity for numa? > > > > > This is what happened, inorder to reduce the alphabet soup of cyclictest > > > options, we decided that if you don't specify --smp, we would test if > > > numa was available and go with that and both implied -a, so we are missing > > > a third option. > > > > > > Another possibility I guess would be to put back --numa, since it's > > > starting to appear that the cure was worse than the disease. At least > > > that's more explicit than using -a to get numa behaviour. > > > > This seems more intuitive to me. Maybe --numa w/o -a would only > > affine the cpumask to nodes, rather than cpus? > > Thinking about this for just a second more. This shouldn't be too > hard. Just have to get a node->cpumask mapping, rather than affining > thread x -> cpu x, you would just affine to > node_mask(node(thread_id)). > > thanks > -john > I had time to have a closer look at this, and it works fine, and isn't causing any problems for rteval for example, which is explicitly setting -a anyway when it calls cyclictest Thanks for your patch! Signed-off-by: John Kacur <jkacur@xxxxxxxxxx>