Hi Bijan, On Tue, Nov 06, 2018 at 04:32:27PM -0800, Bijan Mottahedeh wrote: > This patch series address two Qemu issues: This series should primarily go to qemu-devel (as it is a QEMU patch). Could you please re-send the series to qemu-devel. Keeping the kvmarm list on cc is nice, but only a limited set of people following KVM/Arm development is actively reviewing QEMU patches. Thanks, Christoffer > > - improper system clock frequency initialization > - lack of pause (virtsh suspend) time accounting > > A simple test to reproduce the problem executes one or more instances > of the following command in the guest: > > dd if=/dev/zero of=/dev/null & > > and then pauses and resumes the guest after a certain delay: > > virsh suspend <guest> # pauses the guest > sleep 120 > virsh resume <guest> > > After the guest is resumed, there are soft lockup warning messages > displayed on the console. > > A comparison with x86 shows that hwclock and date values diverge after > the above pause and resume sequence for x86 but remain the same for Arm. > > Patch 1 intializes the system clock frequency in Qemu similar to the > kernel. > > Patch 2 accumulates the total guest pause time in QEMU and adjusts the > virtual offset counter accordingly before the guest is resumed. > > The patches have been tested on an Ampere system. With the patches the > time behavior is the same as x86 and the soft lockup messages go away. > > > Clock Frequency Initialization > ============================== > > Arm v8 provides the virtual counter (cntvct), virtual counter offset > (cntvoff), and counter frequency (cntfrq) registers for guest time > management. > > Linux Arm platform code initializes the system clock frequency from > cntrfq_el0 register and sets the value into a statically created device > tree (DT) node. It is not clear why the timer device node is created > with TIMER_OF_DECLARE(). The DT passed from Qemu to the kernel does not > contain a timer node. > > drivers/clocksource/arm_arch_timer.c: > > static inline u32 arch_timer_get_cntfrq(void) > { > return read_sysreg(cntfrq_el0); > } > > rate = arch_timer_get_cntfrq(); > arch_timer_of_configure_rate(rate, np); > > /* > * For historical reasons, when probing with DT we use whichever (non-zero) > * rate was probed first, and don't verify that others match. If the first node > * probed has a clock-frequency property, this overrides the HW register. > */ > static void arch_timer_of_configure_rate(u32 rate, struct device_node *np) > { > ... > if (of_property_read_u32(np, "clock-frequency", &arch_timer_rate)) { > arch_timer_rate = rate; > ... > } > > TIMER_OF_DECLARE(armv7_arch_timer, "arm,armv7-timer", arch_timer_of_init); > TIMER_OF_DECLARE(armv8_arch_timer, "arm,armv8-timer", arch_timer_of_init); > > > Linux then initializes the clock frequency to 50MHZ. > > Qemu however hard codes the clock frequency to 62.5MHZ. > > target/arm/cpu.h: > > /* Scale factor for generic timers, ie number of ns per tick. > * This gives a 62.5MHz timer. > */ > #define GTIMER_SCALE 16 > > The suggested fix is to follow the kernel's arch_timer_get_cntfrq() > approach in order to set system_clock_scale to match the kernel's idea > of clock-frequency, rather than using a hard-coded value. > > Ultimately, it seems that Qemu should construct the timer DT node and > pass the actual clock frequency value to the kernel that way but that > brings up an interface and backward compatibility considerations. > Furthermore, the implications for ACPI method of probing is not clear. > > > Pause Time Accounting > ===================== > > Linux registers two clock sources, a platform-independent jiffies > clocksource and a Arm-specific arch_sys_counter; the read interface > for the latter reads the virtual counter register: > > static struct clocksource clocksource_jiffies = { > .name = "jiffies", > .rating = 1, /* lowest valid rating*/ > .read = jiffies_read, > .mask = CLOCKSOURCE_MASK(32), > .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */ > .shift = JIFFIES_SHIFT, > .max_cycles = 10, > }; > > static struct clocksource clocksource_counter = { > .name = "arch_sys_counter", > .rating = 400, > .read = arch_counter_read, > .mask = CLOCKSOURCE_MASK(56), > .flags = CLOCK_SOURCE_IS_CONTINUOUS, > }; > > arch_counter_read() > -> arch_timer_read_counter() > -> arch_counter_get_cntvct() > -> arch_timer_reg_read_stable(cntvct_el0) > > The virtual counter offset register is set from: > > kvm_timer_vcpu_load() > -> set_cntvoff() > > The counter is zeroed from: > > kvm_timer_vcpu_put() > -> set_cntvoff() > > /* > * The kernel may decide to run userspace after calling vcpu_put, so > * we reset cntvoff to 0 to ensure a consistent read between user > * accesses to the virtual counter and kernel access to the physical > * counter of non-VHE case. For VHE, the virtual counter uses a fixed > * virtual offset of zero, so no need to zero CNTVOFF_EL2 register. > */ > if (!has_vhe()) > set_cntvoff(0); > > The virtual counter offset is not modified anywhere however to account > for pause time. The suggested fix is to add pause time accounting to > Qemu. > > One potential issue is whether modifying the virtual counter offset > breaks any assumptions, e.g., see the kvm_timer_vcpu_put() comment above. > > > hwclock vs. date > ================ > > The hwclock on the ends up in drivers/rtc/rtc-pl031.c > > Real Time Clock interface for ARM AMBA PrimeCell 031 RTC > > ioctl("/dev/rtc", RTC_RD_TIME, ...) > -> rtc_dev_ioctl() > -> rtc_read_time() > -> __rtc_read_time() > -> pl031_read_time() > > > The date command reads time from from a vdso page updated as follows: > > irq_enter() > -> tick_irq_enter() > -> tick_nohz_irq_enter() > -> tick_nohz_update_jiffies() > -> tick_do_update_jiffies64() > -> tick_do_update_jiffies64() > -> update_wall_time() > -> timekeeping_advance() > -> timekeeping_update() > -> update_vsyscall(struct timekeepr *tk) > > The struct timekeeper uses the Arm platform clocksource_counter noted above: > > (gdb) p tk->tkr_mono > $4 = {clock = 0xffff0000097ddad0 <clocksource_counter>, > mask = 72057594037927935, cycle_last = 271809294296, mult = 335544320, > shift = 24, xtime_nsec = 3456106496000000, base = 5435795172160, > base_real = 1539126455000000000} > > This would explain why without any pause time accounting, the both > hwclock and date command show the same time after the guest is resume > from a pause, e.g. with the following sequence: > > virsh suspend <guest> > sleep <seconds> > virsh resume <guest> > > Bijan Mottahedeh (2): > arm/virt: Initialize generic timer scale factor dynamically > arm/virt: Account for guest pause time > > hw/arm/virt.c | 15 +++++++++++++++ > hw/intc/arm_gicv3_kvm.c | 39 +++++++++++++++++++++++++++++++++++++++ > target/arm/cpu.h | 3 +++ > target/arm/helper.c | 19 ++++++++++++++++--- > target/arm/internals.h | 8 ++++++-- > target/arm/kvm64.c | 1 + > 6 files changed, 80 insertions(+), 5 deletions(-) > > -- > 1.8.3.1 > > _______________________________________________ > kvmarm mailing list > kvmarm@xxxxxxxxxxxxxxxxxxxxx > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm