On Fri, 2024-08-02 at 16:57 -0700, Paul E. McKenney wrote: > > I signal pthread condition variable > > that calls: > > > > - gettid() > > - futex() > > > > and madvise(MADV_DONTNEED) > > (I believe this comes from tcmalloc) > > > > you are opening up my horizons and I think you are right. If the > > nohz_full thread does not enter the kernel, it cannot interfere > > with > > your nohz_full setup. > > Exactly! > > One approach is to have real-time threads used lockless shared-memory > mechanism to communicate asynchronously with non-real-time threads. > The Linux-kernel llist is one example of such a mechanism. > > Many projects do all their allocation during initialization, thus > avoiding run-time issues from the memory allocator. Some use > memlock() > or memlockall() at initialization time to avoid page faults. > I want to bring this discussion to closure because I feel like I am spamming your list with off topic stuff but my NOHZ_FULL issue has experienced a happy ending so I wanted to share the outcome since you have helped me a lot! the mix_interrupt_randomness() was a symptom only. The only thing that can make char device random mix interrupt randomness is if a processor receives interrupts. It turns out that some file i/o was initated without my awareness from cpu0 or 1. I got a quick crash course from Jens Axboe about nvme device driver anatomy. It creates 2 i/o requests queues and split the cpus in 2 balanced groups. not sure exactly why irqs are still sent to different cpus since I provide irqaffinity=3 in my boot params line but the solution has been to hack the nvme driver code: drivers/nvme/host/pci.c diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6cd9395ba9ec..70b7ca84ee21 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2299,7 +2299,7 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev) */ if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) return 1; - return num_possible_cpus() + dev->nr_write_queues + dev- >nr_poll_queues; + return 1 + dev->nr_write_queues + dev->nr_poll_queues; } static int nvme_setup_io_queues(struct nvme_dev *dev) the end result is that my nvme device now only have 1 io queue managing requests coming from all 4 cpus and the irqs are all sent to cpu3! about eliminating syscalls, it will not be necessary but I am keeping this new trick into my bag of tricks... I might need it sometime... that being said, turning my attention to this has allowed me to realize that I was using tcmalloc untweaked and I can squeeze much more performance out of it with some minimal effort... 1. It use Transparent Huge Pages and recommends some system settings for optimal performance that I have not set yet. 2. It offers a hook to offload memory housekeeping to a background thread (including the madvise(MADV_DONTNEED) syscall I suppose). this is clearly something that I will look into that I would not have thought looking at without your suggestion! all in all, it is an amazing experience to be a Linux Kernel user! I now have practically 2 CPUs ot of 4 that runs at 100% with almost no interrupts... this is a very cool and satisfying achievement! If I stare long enough at 'watch -n1 cat /proc/interrupts' I am seeing occasional CAL and TLB (TLB is a totally new concept to me) CAL: 1726 803 784 300814 Function call interrupts TLB: 44 0 43 0 TLB shootdowns something else to understand and master another day... > > I need to take a break from this project to take care of other > > stuff > > that I have neglected while being absorbed by this never-ending > > challenge... but I'll definitely return to it with this new angle > > of > > attack... > > > > with the little amount of syscalls, it seems feasible to avoid them > > in > > one way or the other. > > > > at some point, it might be much easier to avoid the kernel than > > trying > > to fight with it to do what you want it to do. > > These things go on a syscall-by-syscall basis, and much depends on > your deadlines. For example, if you had one-millisecond deadlines, > those 27-microsecond interrupts might not be a concern. ;-) this is for a small crypto arbitrage trading client project of mine, the trader that react the fastest wins... 27usec is a big deal. It is the difference between winning or losing a trade https://x.com/lano1106/status/1771345949320737183