Re: [PATCH v4 0/5] x86: fix hang when AP bringup is too slow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 14 Apr 2014 17:11:12 +0200
Igor Mammedov <imammedo@xxxxxxxxxx> wrote:

> changes since v3:
>  * put simple bugfixes first
>  * move common part of syncing with master CPU in cpu_init()
>    for x32/64 variant into helper function
>  * cpu_init(): WARN_ON if cpu_initialized_mask is set
>  * fix panic on CPU unplug, caused by erroneous removing
>    of "pr->dev = dev;" in drivers/acpi/acpi_processor.c
Hi guys,

It seems there won't be more comments on series,
could you review it, please?

> 
> --
> Hang is observed on virtual machines during CPU hotplug,
> especially in big guests with many CPUs. (It happens more
> often if host is over-committed).
> 
> Hang happens because master CPU timeouts on waiting till
> AP boots and 'cancels' CPU online operation assuming AP
> is not functional but AP may continue run wild later
> causing various hangs or panics in running kernel that
> is assuming that AP was offline.
> 
> This is an alternative approach, that instead of canceling
> in-progress AP bringup (https://lkml.org/lkml/2014/3/6/257),
> removes timeouts so that AP bringup won't be affected by
> poor timing and syncs AP with master CPU at early startup
> making sure that AP won't run wild if master CPU doesn't
> expect AP to come online.
> 
> Series also fixes 3 bugs found during testing CPU bringup
> failure case.
> 
> --
> Below is the detailed description of a more often happening hang:
> ---
> Master CPU may timeout before cpu_callin_mask is set and cancel
> booting CPU, but being onlined CPU still continues to boot, sets
> cpu_active_mask (CPU_STARTING notifiers) and spins in
> check_tsc_sync_target() for master cpu to arrive. Following attempt
> to online another cpu hangs in stop_machine, initiated from here:
> smp_callin ->
>   smp_store_cpu_info ->
>     identify_secondary_cpu ->
>       mtrr_ap_init -> set_mtrr_from_inactive_cpu
> 
> stop_machine waits on completion of stop_work on all CPUs from
> cpu_active_mask including a failed CPU that spins in check_tsc_sync_target().
> 
> 
> Igor Mammedov (5):
>   x86: fix list corruption on CPU hotplug
>   x86: fix memory corruption in acpi_unmap_lsapic()
>   acpi_processor: do not mark present at boot but not onlined CPU as
>     onlined
>   x86: log error on secondary CPU wakeup failure at ERR level
>   x86: initialize secondary CPU only if master CPU will wait for it
> 
>  arch/x86/kernel/cpu/common.c  |   27 ++++++----
>  arch/x86/kernel/smpboot.c     |  103 ++++++++++++----------------------------
>  drivers/acpi/acpi_processor.c |    1 -
>  3 files changed, 47 insertions(+), 84 deletions(-)
> 


-- 
Regards,
  Igor
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux