Re: unexpected result with rcu_nocbs option

Olivier Langlois <olivier@xxxxxxxxxxxxxx> · Sat, 03 Aug 2024 18:46:44 -0400

On Fri, 2024-08-02 at 16:57 -0700, Paul E. McKenney wrote:
> > I signal pthread condition variable
> > that calls:
> > 
> > - gettid()
> > - futex()
> > 
> > and madvise(MADV_DONTNEED)
> > (I believe this comes from tcmalloc)
> > 
> > you are opening up my horizons and I think you are right. If the
> > nohz_full thread does not enter the kernel, it cannot interfere
> > with
> > your nohz_full setup.
> 
> Exactly!
> 
> One approach is to have real-time threads used lockless shared-memory
> mechanism to communicate asynchronously with non-real-time threads.
> The Linux-kernel llist is one example of such a mechanism.
> 
> Many projects do all their allocation during initialization, thus
> avoiding run-time issues from the memory allocator.  Some use
> memlock()
> or memlockall() at initialization time to avoid page faults.
> 

I want to bring this discussion to closure because I feel like I am
spamming your list with off topic stuff but my NOHZ_FULL issue has
experienced a happy ending so I wanted to share the outcome since you
have helped me a lot!

the mix_interrupt_randomness() was a symptom only. The only thing that
can make char device random mix interrupt randomness is if a processor
receives interrupts.

It turns out that some file i/o was initated without my awareness from
cpu0 or 1. I got a quick crash course from Jens Axboe about nvme device
driver anatomy. It creates 2 i/o requests queues and split the cpus in
2 balanced groups.

not sure exactly why irqs are still sent to different cpus since I
provide irqaffinity=3 in my boot params line but the solution has been
to hack the nvme driver code:

drivers/nvme/host/pci.c

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6cd9395ba9ec..70b7ca84ee21 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2299,7 +2299,7 @@ static unsigned int nvme_max_io_queues(struct
nvme_dev *dev)
         */
        if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
                return 1;
-       return num_possible_cpus() + dev->nr_write_queues + dev-
>nr_poll_queues;
+       return 1 + dev->nr_write_queues + dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)

the end result is that my nvme device now only have 1 io queue managing
requests coming from all 4 cpus and the irqs are all sent to cpu3!

about eliminating syscalls, it will not be necessary but I am keeping
this new trick into my bag of tricks... I might need it sometime...

that being said, turning my attention to this has allowed me to realize
that I was using tcmalloc untweaked and I can squeeze much more
performance out of it with some minimal effort...

1. It use Transparent Huge Pages and recommends some system settings
for optimal performance that I have not set yet.
2. It offers a hook to offload memory housekeeping to a background
thread (including the madvise(MADV_DONTNEED) syscall I suppose).

this is clearly something that I will look into that I would not have
thought looking at without your suggestion!

all in all, it is an amazing experience to be a Linux Kernel user!

I now have practically 2 CPUs ot of 4 that runs at 100% with almost no
interrupts... this is a very cool and satisfying achievement!

If I stare long enough at 'watch -n1 cat /proc/interrupts'

I am seeing occasional CAL and TLB (TLB is a totally new concept to me)
CAL:       1726        803        784     300814   Function call
interrupts
TLB:         44          0         43          0   TLB shootdowns

something else to understand and master another day...

> > I need to take a break from this project to take care of other
> > stuff
> > that I have neglected while being absorbed by this never-ending
> > challenge... but I'll definitely return to it with this new angle
> > of
> > attack...
> > 
> > with the little amount of syscalls, it seems feasible to avoid them
> > in
> > one way or the other.
> > 
> > at some point, it might be much easier to avoid the kernel than
> > trying
> > to fight with it to do what you want it to do.
> 
> These things go on a syscall-by-syscall basis, and much depends on
> your deadlines.  For example, if you had one-millisecond deadlines,
> those 27-microsecond interrupts might not be a concern.  ;-)

this is for a small crypto arbitrage trading client project of mine,
the trader that react the fastest wins... 27usec is a big deal. It is
the difference between winning or losing a trade

https://x.com/lano1106/status/1771345949320737183