Re: [PATCH v3 2/2] arm64: add micro test

Shih-Wei Li <shihwei@xxxxxxxxxxxxxxx> · Tue, 23 Jan 2018 13:48:43 -0500

On Mon, Jan 22, 2018 at 3:48 AM, Andrew Jones <drjones@xxxxxxxxxx> wrote:
> On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
>> Thanks for the feedback about the mistakes in math and some issues in
>> naming, print msg, and coding style. I'll be careful and try to avoid
>> the same problems the next patch set. Sorry for all of the confusion.
>>
>> So we now skip the test when "sample == 0" happens over 1000 times.
>> This is only due to the case that "cost is < 1/cntfrq" since it's not
>
> Since we know the reason it happens, then why don't we just print that
> message so the user can know it too?

Yes, we can print out some info when the counter overflows.

>
>> possible for the tick to overflow for that many times. Did I miss
>> something here? I do agree that we should output better msgs to tell
>> users that the cost of a certain test is constantly smaller than a
>> tick.
>
> Yes
>
>>
>> Also, because ticks overflow should rarely happen, I wonder if we can
>> simply ignore the data set if it happens? If not, any thoughts on how
>> to efficiently distinguish one case from the other?
>
> If we get tons of zeros, then the test is faster than 1/cntfrq, print it.
> If we just get one, possibly two zeros, then it was overflow, skip it.

Yes.

>
>>
>> Thinking of the time granularity used in the output, given there's
>> likely difference in hardware implementation as previously mentioned,
>> should we output the cost in ticks (along with frequency) like I did
>> in v1 & v2 instead, allowing the users to do the translation to the exact
>> time based on the output?
>
> Why? If it's possible for the user to do the math, then it's possible for
> the test to do the math. It's easy to do, so why put the burden on the
> user? People don't think in terms of timer ticks, they think in terms of
> time.
>

I was unsure about the right granularity (microseconds vs picoseconds)
to use. I can fix the math and output in microseconds in v4 if it
sounds good to you guys?

Thanks,
Shih-Wei

> Thanks,
> drew
>
>>
>> Thanks again,
>> Shih-Wei
>>
>> On Thu, Jan 18, 2018 at 6:39 AM, Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> wrote:
>> > On Thu, Jan 18, 2018 at 11:37:37AM +0100, Andrew Jones wrote:
>> >> On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
>> >> > On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
>> >> > > Here we provide the support for measuring various micro level
>> >> > > operations on arm64. We iterate each of the tests and output
>> >> > > their average, minimum and maximum cost in microseconds.
>> >> > > Instruction barriers were used before taking timestamps to
>> >> > > avoid out-of-order execution or pipelining from skewing our
>> >> > > measurements.
>> >> > >
>> >> > > The operations we currently supported and measured are mostly
>> >> > > self-explanatory by their function name. For IPI, we measured the
>> >> > > cost of sending IPI from a source VCPU to a target VCPU, until the
>> >> > > target VCPU receives the IPI.
>> >> > >
>> >> > > Signed-off-by: Shih-Wei Li <shihwei@xxxxxxxxxxxxxxx>
>> >> > > Signed-off-by: Christoffer Dall <cdall@xxxxxxxxxxxxxxx>
>> >> > > ---
>> >> > >  arm/Makefile.common |   1 +
>> >> > >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> > >  arm/unittests.cfg   |   7 ++
>> >> > >  3 files changed, 260 insertions(+)
>> >> > >  create mode 100644 arm/micro-test.c
>> >> > >
>> >> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
>> >> > > index 0a039cf..c7d5c27 100644
>> >> > > --- a/arm/Makefile.common
>> >> > > +++ b/arm/Makefile.common
>> >> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>> >> > >  tests-common += $(TEST_DIR)/gic.flat
>> >> > >  tests-common += $(TEST_DIR)/psci.flat
>> >> > >  tests-common += $(TEST_DIR)/sieve.flat
>> >> > > +tests-common += $(TEST_DIR)/micro-test.flat
>> >> > >
>> >> > >  tests-all = $(tests-common) $(tests)
>> >> > >  all: directories $(tests-all)
>> >> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
>> >> > > new file mode 100644
>> >> > > index 0000000..407ce8b
>> >> > > --- /dev/null
>> >> > > +++ b/arm/micro-test.c
>> >> > > @@ -0,0 +1,252 @@
>> >> > > +/*
>> >> > > + * Measure the cost of micro level operations.
>> >> > > + *
>> >> > > + * Copyright Columbia University
>> >> > > + * Author: Shih-Wei Li <shihwei@xxxxxxxxxxxxxxx>
>> >> > > + * Author: Christoffer Dall <cdall@xxxxxxxxxxxxxxx>
>> >> > > + *
>> >> > > + * This work is licensed under the terms of the GNU LGPL, version 2.
>> >> > > + */
>> >> > > +#include <asm/gic.h>
>> >> > > +#include "libcflat.h"
>> >> > > +#include <util.h>
>> >> > > +#include <limits.h>
>> >> > > +
>> >> > > +static volatile bool ipi_received;
>> >> > > +static volatile bool ipi_ready;
>> >> > > +static volatile unsigned int cntfrq;
>> >> > > +static volatile void *vgic_dist_addr;
>> >> >
>> >> > Volatiles considered harmful for kernel code. I think it is also correct
>> >> > for your case:
>> >> > https://lwn.net/Articles/234017/
>> >> >
>> >> > I would prefer use readl/writel accessors for interprocessor
>> >> > communications which guaranties cache synchronization and proper
>> >> > ordering.
>> >> >
>> >> > Also, ipi_received and ipi_ready are looking like synchronization
>> >> > primitives. Did you consider using spinlocks instead?
>> >> >
>> >> > Also-also, cntfrq is written only once at init. What for you make it
>> >> > volatile?
>> >>
>> >> We frequently take some shortcuts in kvm-unit-tests, using busy loops and
>> >> volatile globals for synchronization, trying to keep things as simple as
>> >> possible. I'm ok with the ipi_* synchronization variables being volatile,
>> >> but agree the other two don't need to be.
>> >>
>> >> >
>> >> > > +void (*write_eoir)(u32 irqstat);
>> >> > > +
>> >> > > +#define IPI_IRQ          1
>> >> > > +
>> >> > > +#define TRIES    (1U << 28)
>> >> > > +
>> >> > > +#define MAX_FAILURES     1000
>> >> > > +
>> >> > > +/*
>> >> > > + * The counter may not always start with zero, which means it could
>> >> > > + * overflow after some period of time.
>> >> > > + */
>> >> > > +#define COUNT(c1, c2) \
>> >> > > + ((c1) > (c2) ? 0 : (c2) - (c1))
>> >> > > +
>> >> > > +static uint64_t read_cc(void)
>> >> > > +{
>> >> > > + isb();
>> >> > > + return read_sysreg(cntpct_el0);
>> >> > > +}
>> >> > > +
>> >> > > +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> >> > > +{
>> >> > > + u32 ack;
>> >> > > + ipi_ready = false;
>> >> > > + ipi_received = true;
>> >> > > + ack = gic_read_iar();
>> >> > > + gic_write_eoir(ack);
>> >> > > + ipi_ready = true;
>> >> > > +}
>> >> > > +
>> >> > > +static void ipi_test_secondary_entry(void *data __unused)
>> >> > > +{
>> >> > > + enum vector v = EL1H_IRQ;
>> >> > > + install_irq_handler(v, ipi_irq_handler);
>> >> > > +
>> >> > > + gic_enable_defaults();
>> >> > > +
>> >> > > + local_irq_enable(); /* Enter small wait-loop */
>> >> > > + ipi_ready = true;
>> >> > > + while (true);
>> >> > > +}
>> >> > > +
>> >> > > +static int test_init(void)
>> >> > > +{
>> >> > > + int v;
>> >> > > +
>> >> > > + v = gic_init();
>> >> > > + if (v == 2) {
>> >> > > +         vgic_dist_addr = gicv2_dist_base();
>> >> > > +         write_eoir = gicv2_write_eoir;
>> >> > > + } else if (v == 3) {
>> >> > > +         vgic_dist_addr = gicv3_dist_base();
>> >> > > +         write_eoir = gicv3_write_eoir;
>> >> > > + } else {
>> >> > > +         printf("No supported gic present, skipping tests...\n");
>> >> > > +         return 0;
>> >> > > + }
>> >> > > +
>> >> > > + ipi_ready = false;
>> >> > > +
>> >> > > + gic_enable_defaults();
>> >> > > + on_cpu_async(1, ipi_test_secondary_entry, 0);
>> >> > > +
>> >> > > + cntfrq = get_cntfrq();
>> >> > > + printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
>> >> > > +
>> >> > > + return 1;
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long ipi_test(void)
>> >> > > +{
>> >> > > + unsigned int tries = TRIES;
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + while (!ipi_ready && tries--);
>> >> > > + assert(ipi_ready);
>> >> > > +
>> >> > > + ipi_received = false;
>> >> > > +
>> >> > > + c1 = read_cc();
>> >> > > +
>> >> > > + gic_ipi_send_single(IPI_IRQ, 1);
>> >> > > +
>> >> > > + tries = TRIES;
>> >> > > + while (!ipi_received && tries--);
>> >> > > + assert(ipi_received);
>> >> > > +
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long hvc_test(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + c1 = read_cc();
>> >> > > + asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long mmio_read_user(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > + /*
>> >> > > +  * FIXME: Read device-id in virtio mmio here. This address
>> >> > > +  * needs to be updated in the future if any relevent
>> >> > > +  * changes in QEMU test-dev are made.
>> >> > > +  */
>> >> > > + void *mmio_read_user_addr = (void*) 0x0a000008;
>> >> >
>> >> > Again, could you rename it? device_id_ptr sounds much better.
>> >> >
>> >> > > + c1 = read_cc();
>> >> > > + readl(mmio_read_user_addr);
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long mmio_read_vgic(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + c1 = read_cc();
>> >> > > + readl(vgic_dist_addr + GICD_IIDR);
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long eoi_test(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> >> > > +
>> >> > > + /* Avoid measuring assert(..) in gic_write_eoir */
>> >> > > + c1 = read_cc();
>> >> > > + write_eoir(val);
>> >> > > + c2 = read_cc();
>> >> > > +
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +struct exit_test {
>> >> > > + const char *name;
>> >> > > + unsigned long (*test_fn)(void);
>> >> > > + bool run;
>> >> > > +};
>> >> > > +
>> >> > > +static struct exit_test tests[] = {
>> >> > > + {"hvc",                hvc_test,           true},
>> >> > > + {"mmio_read_user",     mmio_read_user,     true},
>> >> > > + {"mmio_read_vgic",     mmio_read_vgic,     true},
>> >> > > + {"eoi",                eoi_test,           true},
>> >> > > + {"ipi",                ipi_test,           true},
>> >> > > +};
>> >> > > +
>> >> > > +static void get_us_output(const char *name,
>> >> > > +                   unsigned long cycles)
>> >> > > +{
>> >> > > + unsigned int ns_per_cycle = 10^9U / cntfrq;
>> >> > > + unsigned int ns, us, us_frac;
>> >> >
>> >> > '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
>> >> > ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.
>> >>
>> >> The counter (cntpct_el0) ticks that fast? We could start with picoseconds
>> >> if necessary.
>> >>
>> >> >
>> >> > For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
>> >> > For CPUs working at 1GHz+ it is 0. What's point in it?
>> >>
>> >> We're not looking at the CPU cycles, we're looking at counter ticks.
>> >> Indeed we could rename cycles to ticks everywhere to make that more
>> >> explicit.
>> >
>> > Ah, I don't say that counter ticks at CPU frequency on TX2, or
>> > something like that. I was probably confused by name. But switching to
>> > picoseconds, or even femtoseconds sounds reasonable anyway.
>> >
>> > Yury
>>