Hello, We managed to trace one of the failing cycles. The trace is here: https://pastebin.com/YJBrSQpJ It seems our application is relinquishing the cpu (line 411) due to a sys_futex call (line 350). We still don't understand why though. We are not very familiar with all the kernel functions. Kind regards. Jordan. On 23 May 2018 at 18:19, Jordan Palacios <jordan.palacios@xxxxxxxxxxxxxxxx> wrote: > On 23 May 2018 at 18:07, Julia Cartwright <julia@xxxxxx> wrote: >> On Wed, May 23, 2018 at 05:43:57PM +0200, Jordan Palacios wrote: >>> Hello, >>> >>> Thanks for the answers. >>> >>> We don't have any nvidia card installed on the system. >>> >>> We'll try the isolcpus in conjunct with our cpuset setup and we'll >>> look into the irq_smp_affinity. >> >> Given the spike magnitudes you are seeing, I doubt they are task >> migration related; meaning I don't think that isolcpus will make a >> difference. >> >>> These are some of the specs of the system. Let me know if you need >>> something else that might be relevant. >>> >>> Active module: Congatec conga-TS77/i7-3612QE >>> Carrier: Connect Tech CCG008 >>> DDR3L-SODIMM-1600 (8GB) >>> Crucial MX200 250GB mSATA SSD >>> >>> I have uploaded one graph with an example of our issue here: >>> >>> https://i.imgur.com/8KoxzNV.png >>> >>> In blue the time between cycles and in green the execution time of >>> each loop. X is in seconds and Y in microseconds. As you can see the >>> execution time is quite constant until we run some intensive IO tasks. >>> In this case those spikes are caused by a hdparm -tT /dev/sda. In this >>> particular instance the spike is no issue since its less than our task >>> period. >> >> Interesting. Does that 2-second higher-latency window directly coincide >> with the starting/stopping of the hdparm load? > > Yes. It coincides with the part that tests cache reads to be more precise. > >>> The problem arises when spikes that are particularly nasty make us go >>> over the 1ms limit, resulting in an overrun. Here is an example: >>> >>> https://i.imgur.com/77sgj3S.png >>> >>> Till now we have only used tracing in our example application but we >>> haven't been able to draw any conclusions. I'll try to obtain a trace >>> of our main update cycle when one of these spikes happen. >> >> This would be most helpful. The first step will be to confirm the >> assumption that nothing else is executing on the CPU with this RT task. >> >> Also, keep in mind that tracing induces some overhead, so you might need >> to adjust your threshold accordingly. I've found that most of the >> latency issues I've debugged can be via the irq, sched, and timer trace >> events (maybe syscalls as well) so that's where I typically start. >> >> It may also be worth a test with a later -rt kernel series like 4.14-rt >> or even 4.16-rt to see if you can reproduce the issue there. >> >> Julia > > Thanks Julia. I'll look into it and report back. > > Kind regards. > > Jordan. -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html