Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs

Song Liu <song@xxxxxxxxxx> · Tue, 22 Nov 2022 22:06:06 -0700

On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
>
> On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> > >
[...]
> > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> > text. The bug stayed in the kernel for almost a year. None of all the available
> > open source benchmark had caught it before this specific benchmark.
>
> That doesn't mean enterpise level testing would not have caught it, and
> enteprise kernels run on ancient kernels so they would not catch up that
> fast. RHEL uses even more ancient kernels than SUSE so let's consider
> where SUSE was during this regression. The commit you mentioned the fix
> 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
> The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
> the 4k pages check completely") and that was on v4.20-rc1~159^2~41
> around September 2018. Around September 2018, the time the regression was
> committed, the most bleeding edge Enterprise Linux kernel in the industry was
> that on SLE15 and so v4.12 and so there is no way in hell the performance
> team at SUSE for instance would have even come close to evaluating code with
> that regression. In fact, they wouldn't come accross it in testing until
> SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.

Can you refer me to one enterprise performance report with open source
benchmark that shows ~1% performance regression? If it is available, I am
more than happy to try it out. Note that, we need some BPF programs to show
the benefit of this set. In most production hosts, network related BPF programs
are the busiest. Therefore, single host benchmarks will not show the benefit.

Thanks,
Song

PS: Data in [1] if full of noise:

"""
2. For each benchmark/system combination, the 1G mapping had the highest
performance for 45% of the tests, 2M for ~30%, and 4k for~20%.

3. From the average delta, among 1G/2M/4K, 4K gets the lowest
performance in all the 4 test machines, while 1G gets the best
performance on 2 test machines and 2M gets the best performance on the
other 2 machines.
"""

There is no way we can get consistent result of 1% performance improvement
from experiments like those.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/