On Wed, May 6, 2020 at 5:59 AM SeongJae Park <sjpark@xxxxxxxxxx> wrote: > > TL; DR: It was not kernel's fault, but the benchmark program. > > So, the problem is reproducible using the lebench[1] only. I carefully read > it's code again. > > Before running the problem occurred "poll big" sub test, lebench executes > "context switch" sub test. For the test, it sets the cpu affinity[2] and > process priority[3] of itself to '0' and '-20', respectively. However, it > doesn't restore the values to original value even after the "context switch" is > finished. For the reason, "select big" sub test also run binded on CPU 0 and > has lowest nice value. Therefore, it can disturb the RCU callback thread for > the CPU 0, which processes the deferred deallocations of the sockets, and as a > result it triggers the OOM. > > We confirmed the problem disappears by offloading the RCU callbacks from the > CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or > priority. > > Someone _might_ still argue that this is kernel problem because the problem > didn't occur on the old kernels prior to the Al's patches. However, setting > the affinity and priority was available because the program received the > permission. Therefore, it would be reasonable to blame the system > administrators rather than the kernel. > > So, please ignore this patchset, apology for making confuse. If you still has > some doubts or need more tests, please let me know. > > [1] https://github.com/LinuxPerfStudy/LEBench > [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820 > [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822 > > > Thanks, > SeongJae Park No harm done, thanks for running more tests and root-causing the issue !