Re: [RFC QEMU 0/2] arm/virt: Account for guest pause time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Bijan,

On Tue, Nov 06, 2018 at 04:32:27PM -0800, Bijan Mottahedeh wrote:
> This patch series address two Qemu issues:

This series should primarily go to qemu-devel (as it is a QEMU patch).

Could you please re-send the series to qemu-devel.  Keeping the kvmarm
list on cc is nice, but only a limited set of people following KVM/Arm
development is actively reviewing QEMU patches.


Thanks,

    Christoffer


> 
>   - improper system clock frequency initialization
>   - lack of pause (virtsh suspend) time accounting
> 
> A simple test to reproduce the problem executes one or more instances
> of the following command in the guest:
> 
> dd if=/dev/zero of=/dev/null &
> 
> and then pauses and resumes the guest after a certain delay:
> 
> virsh suspend <guest>        # pauses the guest
> sleep 120
> virsh resume <guest>
> 
> After the guest is resumed, there are soft lockup warning messages
> displayed on the console.
> 
> A comparison with x86 shows that hwclock and date values diverge after
> the above pause and resume sequence for x86 but remain the same for Arm.
> 
> Patch 1 intializes the system clock frequency in Qemu similar to the
> kernel.
> 
> Patch 2 accumulates the total guest pause time in QEMU and adjusts the
> virtual offset counter accordingly before the guest is resumed.
> 
> The patches have been tested on an Ampere system.  With the patches the
> time behavior is the same as x86 and the soft lockup messages go away.
> 
> 
> Clock Frequency Initialization
> ==============================
> 
> Arm v8 provides the virtual counter (cntvct), virtual counter offset
> (cntvoff), and counter frequency (cntfrq) registers for guest time
> management.
> 
> Linux Arm platform code initializes the system clock frequency from
> cntrfq_el0 register and sets the value into a statically created device
> tree (DT) node.  It is not clear why the timer device node is created
> with TIMER_OF_DECLARE().  The DT passed from Qemu to the kernel does not
> contain a timer node.
> 
> drivers/clocksource/arm_arch_timer.c:
> 
> static inline u32 arch_timer_get_cntfrq(void)
> {
>         return read_sysreg(cntfrq_el0);
> }
> 
> rate = arch_timer_get_cntfrq();
> arch_timer_of_configure_rate(rate, np);
> 
> /*
>  * For historical reasons, when probing with DT we use whichever (non-zero)
>  * rate was probed first, and don't verify that others match. If the first node
>  * probed has a clock-frequency property, this overrides the HW register.
>  */
> static void arch_timer_of_configure_rate(u32 rate, struct device_node *np)
> {
> ...
>    if (of_property_read_u32(np, "clock-frequency", &arch_timer_rate)) {
>           arch_timer_rate = rate;
> ...
> }
> 
> TIMER_OF_DECLARE(armv7_arch_timer, "arm,armv7-timer", arch_timer_of_init);
> TIMER_OF_DECLARE(armv8_arch_timer, "arm,armv8-timer", arch_timer_of_init);
> 
> 
> Linux then initializes the clock frequency to 50MHZ.
> 
> Qemu however hard codes the clock frequency to 62.5MHZ.
> 
> target/arm/cpu.h:
> 
> /* Scale factor for generic timers, ie number of ns per tick.
>  * This gives a 62.5MHz timer.
>  */
> #define GTIMER_SCALE 16
> 
> The suggested fix is to follow the kernel's arch_timer_get_cntfrq()
> approach in order to set system_clock_scale to match the kernel's idea
> of clock-frequency, rather than using a hard-coded value.
> 
> Ultimately, it seems that Qemu should construct the timer DT node and
> pass the actual clock frequency value to the kernel that way but that
> brings up an interface and backward compatibility considerations.
> Furthermore, the implications for ACPI method of probing is not clear.
> 
> 
> Pause Time Accounting
> =====================
> 
> Linux registers two clock sources, a platform-independent jiffies
> clocksource and a Arm-specific arch_sys_counter; the read interface
> for the latter reads the virtual counter register:
> 
> static struct clocksource clocksource_jiffies = {
>         .name           = "jiffies",
>         .rating         = 1, /* lowest valid rating*/
>         .read           = jiffies_read,
>         .mask           = CLOCKSOURCE_MASK(32),
>         .mult           = TICK_NSEC << JIFFIES_SHIFT, /* details above */
>         .shift          = JIFFIES_SHIFT,
>         .max_cycles     = 10,
> };
> 
> static struct clocksource clocksource_counter = {
>         .name   = "arch_sys_counter",
>         .rating = 400,
>         .read   = arch_counter_read,
>         .mask   = CLOCKSOURCE_MASK(56),
>         .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
> };
> 
> arch_counter_read()
> -> arch_timer_read_counter()
>    -> arch_counter_get_cntvct()
>       -> arch_timer_reg_read_stable(cntvct_el0)
> 
> The virtual counter offset register is set from:
> 
> kvm_timer_vcpu_load()
> -> set_cntvoff()
> 
> The counter is zeroed from:
> 
> kvm_timer_vcpu_put()
> -> set_cntvoff()
> 
>         /*
>          * The kernel may decide to run userspace after calling vcpu_put, so
>          * we reset cntvoff to 0 to ensure a consistent read between user
>          * accesses to the virtual counter and kernel access to the physical
>          * counter of non-VHE case. For VHE, the virtual counter uses a fixed
>          * virtual offset of zero, so no need to zero CNTVOFF_EL2 register.
>          */
>         if (!has_vhe())
>                 set_cntvoff(0);
> 
> The virtual counter offset is not modified anywhere however to account
> for pause time.  The suggested fix is to add pause time accounting to
> Qemu.
> 
> One potential issue is whether modifying the virtual counter offset
> breaks any assumptions, e.g., see the kvm_timer_vcpu_put() comment above.
> 
> 
> hwclock vs. date
> ================
> 
> The hwclock on the ends up in drivers/rtc/rtc-pl031.c
> 
> Real Time Clock interface for ARM AMBA PrimeCell 031 RTC
> 
> ioctl("/dev/rtc", RTC_RD_TIME, ...)
> -> rtc_dev_ioctl()
>    -> rtc_read_time()
>       -> __rtc_read_time()
>          -> pl031_read_time()
> 
> 
> The date command reads time from from a vdso page updated as follows:
> 
> irq_enter()
> -> tick_irq_enter()
>    -> tick_nohz_irq_enter()
>       -> tick_nohz_update_jiffies()
>          -> tick_do_update_jiffies64()
>             -> tick_do_update_jiffies64()
>                -> update_wall_time()
>                   -> timekeeping_advance()
>                      -> timekeeping_update()
>                         -> update_vsyscall(struct timekeepr *tk)
> 
> The struct timekeeper uses the Arm platform clocksource_counter noted above:
> 
> (gdb) p tk->tkr_mono
> $4 = {clock = 0xffff0000097ddad0 <clocksource_counter>,
>   mask = 72057594037927935, cycle_last = 271809294296, mult = 335544320,
>   shift = 24, xtime_nsec = 3456106496000000, base = 5435795172160,
>   base_real = 1539126455000000000}
> 
> This would explain why without any pause time accounting, the both
> hwclock and date command show the same time after the guest is resume
> from a pause, e.g. with the following sequence:
> 
> virsh suspend <guest>
> sleep <seconds>
> virsh resume <guest>
> 
> Bijan Mottahedeh (2):
>   arm/virt: Initialize generic timer scale factor dynamically
>   arm/virt: Account for guest pause time
> 
>  hw/arm/virt.c           | 15 +++++++++++++++
>  hw/intc/arm_gicv3_kvm.c | 39 +++++++++++++++++++++++++++++++++++++++
>  target/arm/cpu.h        |  3 +++
>  target/arm/helper.c     | 19 ++++++++++++++++---
>  target/arm/internals.h  |  8 ++++++--
>  target/arm/kvm64.c      |  1 +
>  6 files changed, 80 insertions(+), 5 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@xxxxxxxxxxxxxxxxxxxxx
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm



[Index of Archives]     [Linux KVM]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux