RE: [PATCH v2] ACPI / OSL: Fix rcu synchronization logic

"Zheng, Lv" <lv.zheng@xxxxxxxxx> · Thu, 12 Jan 2017 03:38:17 +0000

Hi, Rafael

> From: rjwysocki@xxxxxxxxx [mailto:rjwysocki@xxxxxxxxx] On Behalf Of Rafael J. Wysocki
> Subject: Re: [PATCH v2] ACPI / OSL: Fix rcu synchronization logic
> 
> On Wed, Jan 11, 2017 at 10:54 AM, Lv Zheng <lv.zheng@xxxxxxxxx> wrote:
> > The rcu synchronization logic is originally provided to protect
> > apei_read()/apei_write() as in the APEI drivers, there is NMI event source
> > requiring non spinlock based synchronization mechanism. After that, ACPI
> > developers think FADT registers may also require same facility, so they
> > moved the RCU stuffs to generic ACPI layer. So now interrupt (irq, nmi)
> > context ACPI map lookup is only protected by RCU.
> >
> > This triggers a kernel hang when ACPICA API starts to unmap unused tables
> > (see Link #1 for details). The cause of the hang is
> > drivers/iommu/amd_iommu_init.c calling acpi_put_table() too early and then
> > land in RCU which is uninitialized yet:
> >   early_amd_iommu_init()
> >     acpi_put_table(ivrs_base)
> >       acpi_os_unmap_memory()
> >         synchronize_rcu_expedited()
> > Now that function goes and sends IPIs, i.e., schedule_work() but this is
> > too early - workqueue_init() hasn't been invoked. Actually, from looking
> > at the callstack, we do kernel_init_freeable()->native_smp_prepare_cpus()
> > and workqueue_init() comes next.
> >
> > Actually this facility is only used to protect interrupt context ACPI map
> > lookup, and such mappings are only introduced by
> > acpi_os_map_generic_address(). So before it is invoked, there is no need to
> > use RCU, mutex should be used instead of.
> 
> Why should the mutex be used then?
> 
> There's nothing wrong with using RCU in all cases except when that
> happens early on boot when things don't work as expected, but at that
> time there's no need to do the synchronization *at* *all* as the list
> protected by RCU is not even walked.

If the issue can be fixed in a point that is later than the
early_amd_iommu_init() and earlier than all acpi_os_read/write_memory(). That's
great as it means the issue can be fixed in the simplest way.

If not (means there is always a chance acpi_os_read/write_memory() can happen
before early_amd_iommu_init() or other initcalls which invoke
acpi_put_table()), we then need this workaround.

If all acpi_os_read/write_memory() are invoked from the task context, then we
can just use the mutex_lock() instead of rcu_read_lock() during this stage. As
all acpi ioremaps are accessed in the task context.

So with the mutex mechanism working during this stage, the WARN_ON_ONCE()
you've introduced into acpi_os_read/write_memory() actually can be converted
into WARN_ON_ONCE(!early_enough && in_interrupt()). And the early_enough flag
can be determined by the first acpi_os_map_generic_address() call.

The bit more complicated workaround is then generated after wrapping the
rcu_read_lock()/mutex_lock() using acpi_lock_ioremap() and moving the
WARN_ON_ONCE() into it.

Which IMO can handle more cases than the simplest fix. And can make us immune
to the future similar issues as:
1. The ACPI subsystem now can control the order itself by ensuring all
   acpi_os_map_generic_address() callers are late enough, we'll not be messed
   up by the external initcall orders.
2. Some ACPI initialization steps (for example, namespace initialization) can
   be changed without the worry to trigger similar regressions.

Thanks and best regards
Lv
��.n��������+%������w��{.n�����{�����ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f