Re: [PATCH v2 0/2] MIPS: convert to generic entry

Feiyang Chen <chris.chenfeiyang@xxxxxxxxx> · Fri, 22 Oct 2021 10:19:20 +0800

On Tue, 19 Oct 2021 at 16:33, Maciej W. Rozycki <macro@xxxxxxxxxxx> wrote:
>
> On Tue, 19 Oct 2021, Feiyang Chen wrote:
>
> > > > Score Without Patches  Score With Patches  Performance Change SoC Model
> > > >        105.9                102.1              -3.6%  JZ4775
> > > >        132.4                124.1              -6.3%  JZ4780(SMP off)
> > > >        170.2                155.7             -8.5%  JZ4780(SMP on)
> > > >        101.3                 91.5              -9.7%  X1000E
> > > >        187.1                179.4              -4.1%  X1830
> > > >        324.9                314.3              -3.3%  X2000(SMT off)
> > > >        394.6                373.9              -5.2%  X2000(SMT off)
> > > >
> > > >
> > > > Compared with the V1 version, there are some improvements, but the performance
> > > > loss is still a bit obvious
> > >
> > >  The MIPS port of Linux has always had the pride of having a particularly
> > > low syscall overhead and I'd rather we didn't lose this quality.
> >
> > Hi, Maciej,
> >
> > 1. The current trend is to use generic code, so I think this work is
> > worth it, even if there is some performance loss.
>
>  Well, a trend is not a proper justification on its own for existing code,
> and mature one for that matter, that works.  Surely it might be for an
> entirely new port, but the MIPS port is not exactly one.
>
> > 2. We tested the performance on 5.15-rc1~rc5 and the performance
> > loss on JZ4780 (SMP off) is not so obvious (about -3%).
>
>  I've seen teams work hard to improve performance by less than 3%, so
> depending on how you look at it the loss is not necessarily small, even if
> not abysmal.  And I find the figure of almost 10% cited for another system
> even more worrisome.  Also you've written the figures are from UnixBench,
> which I suppose measures some kind of an average across various workloads.
> Can you elaborate on the methodology used by that benchmark?

Hi, Maciej,

UnixBench uses multiple tests to test various aspects of the system's
performance:

- Dhrystone test measures the speed and efficiency of non-floating-point
  operations.
- Whetstone test measures the speed and efficiency of floating-point
  operations.
- execl Throughput test measures the number of execl() calls that can be
  performed per second.
- File Copy test measures the rate at which data can be transferred from one
  file to another, using various buffer sizes.
- Pipe Throughput test measures the number of times (per second) a process
  can write 512 bytes to a pipe and read them back.
- Pipe-based Context Switching test measures the number of times two
  processes can exchange an increasing integer through a pipe.
- Process Creation test measures the number of times a process can fork and
  reap a child that immediately exits.
- Shell Scripts test measures the number of times per minute a process can
  start and reap a set of one, two, four and eight concurrent copies of a
  shell script where the shell script applies a series of transformations
  to a data file.
- System Call Overhead test measures the cost of entering and leaving the
  operating system kernel.

In these tests above, the most affected is the System Call Overhead test,
and I'll go into more detail about how it's measured.

The System Call Overhead test counts the sets of system calls that are
completed within the specified time (usually 10 seconds). By default, a set
of system calls contain close(), getpid(), getuid(), and umask(). We call
the test score "index". Specifically, the score for this test is calculated
as follows:

product = log(count) - log(time / timebase)
result = exp(product / iterations)
index = result / baseline * 10

"timebase" and "baseline" are fixed values that are different for each test.
Scores for other tests are calculated in a similar way. The final total
score is calculated as follows (The total number of tests is "N"):

index = exp((log(result1) + log(result2) + ... + log(resultN)) / N) * 10

>
>  Can you tell me what the performance loss is for a cheap syscall such as
> `getuid'?  That would indicate how much is actually lost in the invocation
> overhead.

We use perf to measure the sys time of the the following program on Loongson
3A4000:

int main(void)
{
    for (int i = 0; i < 10000000; i++)
        getuid();
    return 0;
}

The program will take about 1.2 seconds of sys time before the kernel is
patched, and about 1.3 seconds after the kernel is patched.

>
>  With that amount known, would you be able to indicate where exactly the
> performance is getting lost in generic code?  Can it be improved?

Sorry, we tried to use perf to analyze the extra time, but have no idea at
the moment, since most of the code is located in __noinstr_text_start.

Thanks,
Feiyang

>
>   Maciej