On 1/11/19 4:57 PM, Hans van Kranenburg wrote: > On 1/11/19 3:01 PM, Juergen Gross wrote: >> On 11/01/2019 14:12, Hans van Kranenburg wrote: >>> Hi, >>> >>> On 1/11/19 1:08 PM, Juergen Gross wrote: >>>> Commit f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable' >>>> sched_clock() interface") broke Xen guest time handling across >>>> migration: >>>> >>>> [ 187.249951] Freezing user space processes ... (elapsed 0.001 seconds) done. >>>> [ 187.251137] OOM killer disabled. >>>> [ 187.251137] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. >>>> [ 187.252299] suspending xenstore... >>>> [ 187.266987] xen:grant_table: Grant tables using version 1 layout >>>> [18446743811.706476] OOM killer enabled. >>>> [18446743811.706478] Restarting tasks ... done. >>>> [18446743811.720505] Setting capacity to 16777216 >>>> >>>> Fix that by setting xen_sched_clock_offset at resume time to ensure a >>>> monotonic clock value. >>>> >>>> [...] >>> >>> I'm throwing around a PV domU over a bunch of test servers with live >>> migrate now, and in between the kernel logging, I'm seeing this: >>> >>> [Fri Jan 11 13:58:42 2019] Freezing user space processes ... (elapsed >>> 0.002 seconds) done. >>> [Fri Jan 11 13:58:42 2019] OOM killer disabled. >>> [Fri Jan 11 13:58:42 2019] Freezing remaining freezable tasks ... >>> (elapsed 0.000 seconds) done. >>> [Fri Jan 11 13:58:42 2019] suspending xenstore... >>> [Fri Jan 11 13:58:42 2019] ------------[ cut here ]------------ >>> [Fri Jan 11 13:58:42 2019] Current state: 1 >>> [Fri Jan 11 13:58:42 2019] WARNING: CPU: 3 PID: 0 at >>> kernel/time/clockevents.c:133 clockevents_switch_state+0x48/0xe0 >>> [...] >>> >>> This always happens on every *first* live migrate that I do after >>> starting the domU. >> >> Yeah, its a WARN_ONCE(). Ok, false alarm. It's there, but not caused by this patch. I changed the WARN_ONCE to WARN for funs, and now I get it a lot more already (v2): https://paste.debian.net/plainh/d535a379 >> And you didn't see it with v1 of the patch? > > No. I was wrong. I tried a bit more, and I can also reproduce without v1 or v2 patch at all, and I can reproduce it with v4.19.9. Just sometimes needs a dozen times more live migrating it before it shows up. I cannot make it happen to show up with the Debian 4.19.9 distro kernel, that's why it was new for me. So, let's ignore it in this thread now. >> On the first glance this might be another bug just being exposed by >> my patch. >> >> I'm investigating further, but this might take some time. Could you >> meanwhile verify the same happens with kernel 5.0-rc1? That was the >> one I tested with and I didn't spot that WARN. > > I have Linux 5.0-rc1 with v2 on top now, which gives me this on live > migrate: > > [...] > [ 51.871076] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000098 > [ 51.871091] #PF error: [normal kernel read fault] > [ 51.871100] PGD 0 P4D 0 > [ 51.871109] Oops: 0000 [#1] SMP NOPTI > [ 51.871117] CPU: 0 PID: 36 Comm: xenwatch Not tainted 5.0.0-rc1 #1 > [ 51.871132] RIP: e030:blk_mq_map_swqueue+0x103/0x270 > [...] Dunno about all the 5.0-rc1 crashes yet. I can provide more feedback about that if you want, but not in here I presume. Hans