Re: [PATCH v4 5/7] thermal: intel: hfi: Enable notification interrupt

Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx> · Wed, 12 Jan 2022 15:26:41 -0800

On Wed, 2022-01-12 at 20:47 +0100, Rafael J. Wysocki wrote:
> On Sat, Jan 8, 2022 at 4:46 AM Ricardo Neri
> <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
> > When hardware wants to inform the operating system about updates in
> > the HFI
> > table, it issues a package-level thermal event interrupt. For this,
> > hardware has new interrupt and status bits in the
> > IA32_PACKAGE_THERM_
> > INTERRUPT and IA32_PACKAGE_THERM_STATUS registers. The existing
> > thermal
> > throttle driver already handles thermal event interrupts: it
> > initializes
> > the thermal vector of the local APIC as well as per-CPU and
> > package-level
> > interrupt reporting. It also provides routines to service such
> > interrupts.
> > Extend its functionality to also handle HFI interrupts.
> > 
> > The frequency of the thermal HFI interrupt is specific to each
> > processor
> > model. On some processors, a single interrupt happens as soon as
> > the HFI is
> > enabled and hardware will never update HFI capabilities afterwards.
> > On
> > other processors, thermal and power constraints may cause thermal
> > HFI
> > interrupts every tens of milliseconds.
> > 
> > To not overwhelm consumers of the HFI data, use delayed work to
> > throttle
> > the rate at which HFI updates are processed.
> > 
> > 

[...]

> > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
> > +{
> > +       struct hfi_instance *hfi_instance;
> > +       int cpu = smp_processor_id();
> > +       struct hfi_cpu_info *info;
> > +       u64 new_timestamp;
> > +
> > +       if (!pkg_therm_status_msr_val)
> > +               return;
> > +
> > +       info = &per_cpu(hfi_cpu_info, cpu);
> > +       if (!info)
> > +               return;
> > +
> > +       /*
> > +        * It is possible that we get an HFI thermal interrupt on
> > this CPU
> > +        * before its HFI instance is initialized.
Although this code can handle this situation, you can avoid this.

You can call intel_hfi_online(cpu) before 
"
l = apic_read(APIC_LVTTHMR);
	apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED);
"
in thermal_throttle_online()

In the same way call intel_hfi_offline(cpu)
after

/* Mask the thermal vector before draining evtl. pending work */
	l = apic_read(APIC_LVTTHMR);
	apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED);

in thermal_throttle_offline()

> >  This is not a problem. The
> > +        * CPU that enabled the interrupt for this package will
> > also get the
> > +        * interrupt and is fully initialized.
> > +        */
> > +       hfi_instance = info->hfi_instance;
> > +       if (!hfi_instance)
> > +               return;
> 
> Generally, a CPU whose info has been initialized can be offline, so
> this code may run on an offline CPU.
> 
> I'm not actually sure if this is a concern, but just mentioning it in
> case it is.
> 
It will not matter as the handler of the message should be handle case
as CPU can go offline later after the message even if the CPU was
offline.
But I think we can avoid this situation.

> > +
> > +       /*
> > +        * On most systems, all CPUs in the package receive a
> > package-level
> > +        * thermal interrupt when there is an HFI update. It is
> > sufficient to
> > +        * let a single CPU to acknowledge the update and schedule
> > work to
> > +        * process it. The remaining CPUs can resume their work.
> > +        */
> > +       if (!raw_spin_trylock(&hfi_instance->event_lock))
> > +               return;
> > +
> > +       /* Skip duplicated updates. */
> > +       new_timestamp = *(u64 *)hfi_instance->hw_table;
> > +       if (*hfi_instance->timestamp == new_timestamp) {
> > +               raw_spin_unlock(&hfi_instance->event_lock);
> > +               return;
> > +       }
> > +
> > +       raw_spin_lock(&hfi_instance->table_lock);
> > +
> > +       /*
> > +        * Copy the updated table into our local copy. This
> > includes the new
> > +        * timestamp.
> > +        */
> > +       memcpy(hfi_instance->local_table, hfi_instance->hw_table,
> > +              hfi_features.nr_table_pages << PAGE_SHIFT);
> > +
> > +       raw_spin_unlock(&hfi_instance->table_lock);
> > +       raw_spin_unlock(&hfi_instance->event_lock);
> > +
> > +       /*
> > +        * Let hardware know that we are done reading the HFI table
> > and it is
> > +        * free to update it again.
> > +        */
> > +       pkg_therm_status_msr_val &= THERM_STATUS_CLEAR_PKG_MASK &
> > +                                   ~PACKAGE_THERM_STATUS_HFI_UPDAT
> > ED;
> > +       wrmsrl(MSR_IA32_PACKAGE_THERM_STATUS,
> > pkg_therm_status_msr_val);
> > +
> > +       schedule_delayed_work(&hfi_instance->update_work,
> > HFI_UPDATE_INTERVAL);
> 
> AFAICS, 
For my understanding:

> if update_work has been scheduled already,
queue_delayed_work_on is called

>  but is not pending
> yet, the delay will be set to the current time plus
> HFI_UPDATE_INTERVAL, but shouldn't it actually run earlier in that
> case?

"
	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
work_data_bits(work))) {
		__queue_delayed_work(cpu, wq, dwork, delay);
		ret = true;
	}
"
Pending bit will be set only one time, so delay will be from 
the first call of queue_delayed_work_on() + HFI_UPDATE_INTERVAL.

So on subsequent calls of schedule_delayed_work() the delay is always
with reference to first call.

> 
> Also it looks like the processing introduced in the next patch can
> take quite a bit of time if there is a sufficiently large number of
> CPUs in the package, so is it suitable for system_wq in all cases?
> 
Good question. What is the threshold of not using system_wq?
With current set of max cpus/package, I did experiments with busy
systems with a test functions with several workqueues and check if they
drift in expiry. But I think we can move away from system_wq. Can this
be add on patch?

Thanks,
Srinivas

> > +}
> > +
> >  static void init_hfi_cpu_index(struct hfi_cpu_info *info)
> >  {
> >         union cpuid6_edx edx;
> > @@ -259,8 +350,20 @@ void intel_hfi_online(unsigned int cpu)
> > 
> >         init_hfi_instance(hfi_instance);
> > 
> > +       INIT_DELAYED_WORK(&hfi_instance->update_work,
> > hfi_update_work_fn);
> > +       raw_spin_lock_init(&hfi_instance->table_lock);
> > +       raw_spin_lock_init(&hfi_instance->event_lock);
> > +
> >         cpumask_set_cpu(cpu, hfi_instance->cpus);
> > 
> > +       /*
> > +        * Enable the hardware feedback interface and never disable
> > it. See
> > +        * comment on programming the address of the table.
> > +        */
> > +       rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
> > +       msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT;
> > +       wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
> > +
> >  unlock:
> >         mutex_unlock(&hfi_instance_lock);
> >         return;
> > diff --git a/drivers/thermal/intel/intel_hfi.h
> > b/drivers/thermal/intel/intel_hfi.h
> > index 56c6b2d75202..325aa78b745c 100644
> > --- a/drivers/thermal/intel/intel_hfi.h
> > +++ b/drivers/thermal/intel/intel_hfi.h
> > @@ -6,10 +6,12 @@
> >  void __init intel_hfi_init(void);
> >  void intel_hfi_online(unsigned int cpu);
> >  void intel_hfi_offline(unsigned int cpu);
> > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val);
> >  #else
> >  static inline void intel_hfi_init(void) { }
> >  static inline void intel_hfi_online(unsigned int cpu) { }
> >  static inline void intel_hfi_offline(unsigned int cpu) { }
> > +static inline void intel_hfi_process_event(__u64
> > pkg_therm_status_msr_val) { }
> >  #endif /* CONFIG_INTEL_HFI_THERMAL */
> > 
> >  #endif /* _INTEL_HFI_H */
> > diff --git a/drivers/thermal/intel/therm_throt.c
> > b/drivers/thermal/intel/therm_throt.c
> > index 2a79598a7f7a..930e19eeeac6 100644
> > --- a/drivers/thermal/intel/therm_throt.c
> > +++ b/drivers/thermal/intel/therm_throt.c
> > @@ -619,6 +619,10 @@ void intel_thermal_interrupt(void)
> >                                         PACKAGE_THERM_STATUS_POWER_
> > LIMIT,
> >                                         POWER_LIMIT_EVENT,
> >                                         PACKAGE_LEVEL);
> > +
> > +               if (this_cpu_has(X86_FEATURE_HFI))
> > +                       intel_hfi_process_event(msr_val &
> > +                                               PACKAGE_THERM_STATU
> > S_HFI_UPDATED);
> >         }
> >  }
> > 
> > @@ -728,6 +732,12 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> >                         wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT,
> >                               l | (PACKAGE_THERM_INT_LOW_ENABLE
> >                                 | PACKAGE_THERM_INT_HIGH_ENABLE),
> > h);
> > +
> > +               if (cpu_has(c, X86_FEATURE_HFI)) {
> > +                       rdmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, l,
> > h);
> > +                       wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT,
> > +                             l | PACKAGE_THERM_INT_HFI_ENABLE, h);
> > +               }
> >         }
> > 
> >         rdmsr(MSR_IA32_MISC_ENABLE, l, h);
> > --
> > 2.17.1
> >