Re: [PATCH RFC 00/12] x86 NUMA-aware kernel replication

Artem Kuzin <artem.kuzin@xxxxxxxxxx> · Mon, 29 Jan 2024 10:51:17 +0300

On 1/25/2024 7:30 AM, Garg, Shivank wrote:
> Hi Artem,
>
>> Preliminary performance evaluation results:
>> Processor Intel(R) Xeon(R) CPU E5-2690
>> 2 nodes with 12 CPU cores for each one
>>
>> fork/1 - Time measurements include only one time of invoking this system call.
>>          Measurements are made between entering and exiting the system call.
>>
>> fork/1024 - The system call is invoked in  a loop 1024 times.
>>             The time between entering a loop and exiting it was measured.
>>
>> mmap/munmap - A set of 1024 pages (if PAGE_SIZE is not defined it is equal to 4096)
>>               was mapped using mmap syscall and unmapped using munmap one.
>>               Every page is mapped/unmapped per a loop iteration.
>>
>> mmap/lock - The same as above, but in this case flag MAP_LOCKED was added.
>>
>> open/close - The /dev/null pseudo-file was opened and closed in a loop 1024 times.
>>              It was opened and closed once per iteration.
>>
>> mount - The pseudo-filesystem procFS was mounted to a temporary directory inside /tmp only one time.
>>         The time between entering and exiting the system call was measured.
>>
>> kill - A signal handler for SIGUSR1 was setup. Signal was sent to a child process,
>>        which was created using fork glibc's wrapper. Time between sending and receiving
>>        SIGUSR1 signal was measured.
>>
>> Hot caches:
>>
>> fork-1          2.3%
>> fork-1024       10.8%
>> mmap/munmap     0.4%
>> mmap/lock       4.2%
>> open/close      3.2%
>> kill            4%
>> mount           8.7%
>>
>> Cold caches:
>>
>> fork-1          42.7%
>> fork-1024       17.1%
>> mmap/munmap     0.4%
>> mmap/lock       1.5%
>> open/close      0.4%
>> kill            26.1%
>> mount           4.1%
>>
> I've conducted some testing on AMD EPYC 7713 64-Core processor (dual socket, 2 NUMA nodes, 64 CPUs on each node) to evaluate the performance with this patchset.
> I've implemented the syscall based testcases as suggested in your previous mail. I'm shielding the 2nd NUMA node using isolcpus and nohz_full, and executing the tests on cpus belonging to this node.
>
> Performance Evaluation results (% gain over base kernel 6.5.0-rc5):
>
> Hot caches:
> fork-1		1.1%
> fork-1024	-3.8%
> mmap/munmap	-1.5%
> mmap/lock	-4.7%
> open/close	-6.8%
> kill		3.3%
> mount		-13.0%
>
> Cold caches:
> fork-1		1.2%
> fork-1024 	-7.2%
> mmap/munmap 	-1.6%
> mmap/lock 	-1.0%
> open/close 	4.6%
> kill 		-54.2%
> mount 		-8.5%
>
> Thanks,
> Shivank
>
Hi Shivank, thank you for performance evaluation, unfortunately we don't have AMD EPYC right now,
I'll try to find a way to perform measurements and clarify why such difference.

We currently trying to make performance evaluation using database related benchmarks.
Will return with the results after clarification.

BR