Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl

K Prateek Nayak <kprateek.nayak@xxxxxxx> · Tue, 11 Feb 2025 08:57:35 +0530

Hello Christian,

Sorry for the delay in response. I'll leave some analysis from my side
below.

On 1/29/2025 4:39 AM, Cristian Prundeanu wrote:

Peter,

Thank you for the recent scheduler rework which went into kernel 6.13.
Here are the latest test results using mysql+hammerdb, using a standalone
reproducer (details and instructions below).

Kernel | Runtime      | Throughput | P50 latency
aarm64 | parameters   | (NOPM)     | (larger is worse)
-------+--------------+------------+------------------
6.5    | default      |  baseline  |  baseline
-------+--------------+------------+------------------
6.8    | default      |  -6.9%     |  +7.9%
        | NO_PL NO_RTP |  -1%       |  +1%
        | SCHED_BATCH  |  -9%       |  +10.7%
-------+--------------+------------+------------------
6.12   | default      |  -5.5%     |  +6.2%
        | NO_PL NO_RTP |  -0.4%     |  +0.1%
        | SCHED_BATCH  |  -4.1%     |  +4.9%
-------+--------------+------------+------------------
6.13   | default      |  -4.8%     |  +5.4%
        | NO_PL NO_RTP |  -0.3%     |  +0.01%
        | SCHED_BATCH  |  -4.8%     |  +5.4%
-------+--------------+------------+------------------

Thank you for the reproducer. I haven't tried it yet (in part due
to the slightly scary "Assumptions" section) but I managed to find a
HammerDB test bench internally that I modified to match the
configuration from the repro you shared.

Testing methodology is slightly different - the scripts pins mysqld to
the CPUs on the first socket and the HammerDB clients on the second and
measures the throughput (It only reports throughput out of the box; I'll
see if I can get it to report Latency numbers as well.

With that out of the way, these were the preliminary results:

                                    %diff
    v6.14-rc1                       baseline
    v6.5.0 (pre-EEVDF)              -0.95%
    v6.14-rc1 + NO_PL + NO_RTP      +6.06%

So I had myself a reproducer.

Looking at the data from "perf sched stats" [1] (modified to support
reporting with the new schedstats v17) I could see the difference on the
on the mainline kernel (v6.14-rc1) default vs NO_PL + NO_RTP:

    ----------------------------------------------------------------------------------------------------
    Time elapsed (in jiffies)                                        :      109316,     109338
    ----------------------------------------------------------------------------------------------------

    ----------------------------------------------------------------------------------------------------
    CPU <ALL CPUS SUMMARY>
    ----------------------------------------------------------------------------------------------------
    DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
    ----------------------------------------------------------------------------------------------------
    sched_yield() count                                              :       27349,       5785  |   -78.85% |
    Legacy counter can be ignored                                    :           0,          0  |     0.00% |
    schedule() called                                                :      289265,     210475  |   -27.24% |
    schedule() left the processor idle                               :       73316,      73993  |     0.92% |  (    25.35%,     35.16% )
    try_to_wake_up() was called                                      :      154198,     125239  |   -18.78% |
    try_to_wake_up() was called to wake up the local cpu             :       32858,      13927  |   -57.61% |  (    21.31%,     11.12% )
    total runtime by tasks on this processor (in jiffies)            : 27003017867,27700849334  |     2.58% |
    total waittime by tasks on this processor (in jiffies)           : 64285802345,80525026945  |    25.26% |  (   238.07%,    290.70% )
    total timeslices run on this cpu                                 :      190952,     132092  |   -30.82% |
    ----------------------------------------------------------------------------------------------------

[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@xxxxxxx/

The trend is as follows:

- Lower number of schedule() [-27.24%]
- Longer wait times [+25.26%]
- Sightly higher runtime across all CPUs

This is very similar to the situation with other database workloads we
had highlighted earlier that prompted Peter to recommend SCHED_BATCH.

Using the dump_python.py from [2], modifying it to only return pids for
tasks with "comm=mysqld" and running:

    python3 dump_python.py | while read i; do chrt -v -b --pid 0 $i; done

before starting the workload, I was able to match the performance of
SCHED_BATCH with the NO_PL + NO_RTP variant.

[2] https://lore.kernel.org/all/d3306655-c4e7-20ab-9656-b1b01417983c@xxxxxxx/

So it was back to drawing boards on why the setting on your reproducer
might not be working.

A performance improvement is noticeable in kernel 6.13 over 6.12, both in
latency and throughput. At the same time, SCHED_BATCH no longer has the
same positive effect it had in 6.12.

Disabling PLACE_LAG and RUN_TO_PARITY is still as effective as before.
For this reason, I'd like to ask once again that this patch set be
considered for merging and for backporting to kernels 6.6+.

This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY
and moves them to sysctl.

Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
significant performance degradation in multiple database-oriented
workloads. This degradation manifests in all kernel versions using EEVDF,
across multiple Linux distributions, hardware architectures (x86_64,
aarm64, amd64), and CPU generations.

When weighing the relevance of various testing approaches, please keep in
mind that mysql is a real-life workload, while the test which prompted the
introduction of PLACE_LAG is much closer to a synthetic benchmark.

Instructions for reproducing the above tests:

1. Code: The repro scenario that was used for this round of testing can be
found here: https://github.com/aws/repro-collection

Digging through the scripts, I found that SCHED_BATCH setting is done
via systemd in [3] via the "CPUSchedulingPolicy" parameter.

[3] https://github.com/aws/repro-collection/blob/main/workloads/mysql/files/mysqld.service.tmpl

Going back to my setup, the scripts does not daemonize mysqld for the
reasons of portability. It runs the following:

    <root>/bin/mysqld ...
    numactl $server_numactl_param /bin/sh <root>/bin/mysqld_safe ...&
    export BENCHMARK_PID=$!
    ...

$server_numactl_param are CPU and memory affinity for mysqld_safe. Now
interestingly, if I do (version 1):

    /bin/chrt -v -b 0 <root>/bin/mysqld ...
    numactl $server_numactl_param /bin/sh <root>/bin/mysqld_safe ...&
    export BENCHMARK_PID=$!
    ...

I more or less get the same results as baseline v6.14-rc1 (Weird!)
But then if I do (version 2):

    <root>/bin/mysqld ...
    numactl $server_numactl_param /bin/sh <root>/bin/mysqld_safe ...&
    export BENCHMARK_PID=$!

    /bin/chrt -v -b --pid 0 $BENCHMARK_PID;
    ...

I see the performance reach to the same level as that with NO_PL +
NO_RTP. Following are the improvements:

                                             %diff
    v6.14-rc1                                baseline
    v6.5.0 (pre-EEVDF)                       -0.95%
    v6.14-rc1 + NO_PL + NO_RTP               +6.06%
    v6.14-rc1 + (SCHED_BATCH version 1)      +1.42%
    v6.14-rc1 + (SCHED_BATCH version 2)      +6.96%

I'm no database guy but it looks like running mysqld_safe as
SCHED_BATCH which later does a bunch of setup and forks leads to better
performance.

I see there is a mysqld_safe references in your mysql config [4] but I'm
not sure how it works when running with daemonize. Could you login into
your SUT and check if you have a mysqld_safe running and just as a
precautionary measure, run all "mysqld*" tasks / threads under
SCHED_BATCH before starting the load gen? Thank you.

[4] https://github.com/aws/repro-collection/blob/main/workloads/mysql/files/my.cnf.tmpl

I'll keep digging to see if I find anything interesting but in my case,
on a dual socket 3rd Generation EPYC system (2 x 64C/128T) with mysqld*
pinned to one CCX (16CPUs) on one socket and running HammerDB with 64
virtual users, I see the above trends.

If you need any other information or the preliminary changes for perf
sched stats for the new schedstats version, please do let me know. The
series will be refreshed soon with the added support and some more
features.

2. Setup: I used a 16 vCPU / 32G RAM / 1TB RAID0 SSD instance as SUT,
running Ubuntu 22.04 with the latest updates. All kernels were compiled
from source, preserving the same config (as much as possible) to minimize
noise - in particular, CONFIG_HZ=250 was used everywhere.

3. Running: To run the repro, set up a SUT machine and a LDG (loadgen)
machine on the same network, clone the git repo on both, and run:

(on the SUT) ./repro.sh repro-mysql-EEVDF-regression SUT --ldg=<loadgen_IP>

(on the LDG) ./repro.sh repro-mysql-EEVDF-regression LDG --sut=<SUT_IP>

The repro will build and test multiple combinations of kernel versions and
scheduler settings, and will prompt you when to reboot the SUT and rerun
the same command to continue the process.

More instructions can be found both in the repo's README and by running
'repro.sh --help'.

--
Thanks and Regards,
Prateek