Re: [PATCH v4 5/7] thermal: intel: hfi: Enable notification interrupt

Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx> · Thu, 13 Jan 2022 05:48:58 -0800

On Wed, Jan 12, 2022 at 03:26:41PM -0800, Srinivas Pandruvada wrote:
> On Wed, 2022-01-12 at 20:47 +0100, Rafael J. Wysocki wrote:
> > On Sat, Jan 8, 2022 at 4:46 AM Ricardo Neri
> > <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
> > > When hardware wants to inform the operating system about updates in
> > > the HFI
> > > table, it issues a package-level thermal event interrupt. For this,
> > > hardware has new interrupt and status bits in the
> > > IA32_PACKAGE_THERM_
> > > INTERRUPT and IA32_PACKAGE_THERM_STATUS registers. The existing
> > > thermal
> > > throttle driver already handles thermal event interrupts: it
> > > initializes
> > > the thermal vector of the local APIC as well as per-CPU and
> > > package-level
> > > interrupt reporting. It also provides routines to service such
> > > interrupts.
> > > Extend its functionality to also handle HFI interrupts.
> > > 
> > > The frequency of the thermal HFI interrupt is specific to each
> > > processor
> > > model. On some processors, a single interrupt happens as soon as
> > > the HFI is
> > > enabled and hardware will never update HFI capabilities afterwards.
> > > On
> > > other processors, thermal and power constraints may cause thermal
> > > HFI
> > > interrupts every tens of milliseconds.
> > > 
> > > To not overwhelm consumers of the HFI data, use delayed work to
> > > throttle
> > > the rate at which HFI updates are processed.
> > > 
> > > 
> 
> [...]
> 
> > > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
> > > +{
> > > +       struct hfi_instance *hfi_instance;
> > > +       int cpu = smp_processor_id();
> > > +       struct hfi_cpu_info *info;
> > > +       u64 new_timestamp;
> > > +
> > > +       if (!pkg_therm_status_msr_val)
> > > +               return;
> > > +
> > > +       info = &per_cpu(hfi_cpu_info, cpu);
> > > +       if (!info)
> > > +               return;
> > > +
> > > +       /*
> > > +        * It is possible that we get an HFI thermal interrupt on
> > > this CPU
> > > +        * before its HFI instance is initialized.
> Although this code can handle this situation, you can avoid this.
> 
> You can call intel_hfi_online(cpu) before 
> "
> l = apic_read(APIC_LVTTHMR);
> 	apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED);
> "

Indeed this could work.

intel_hfi_online() also enables the HFI. It is possible to get an HFI
interrupt as soon as we enable it. I was concerned about getting the
interrupt with APIC_LVTTHMR masked and miss it. This should not be
a problem since we will get it as soon as we unmask it.

> in thermal_throttle_online()
> 
> In the same way call intel_hfi_offline(cpu)
> after
> 
> /* Mask the thermal vector before draining evtl. pending work */
> 	l = apic_read(APIC_LVTTHMR);
> 	apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED);
> 
> in thermal_throttle_offline()

Sure, I can do this. 

> 
> 
> > >  This is not a problem. The
> > > +        * CPU that enabled the interrupt for this package will
> > > also get the
> > > +        * interrupt and is fully initialized.
> > > +        */
> > > +       hfi_instance = info->hfi_instance;
> > > +       if (!hfi_instance)
> > > +               return;
> > 
> > Generally, a CPU whose info has been initialized can be offline, so
> > this code may run on an offline CPU.
> > 
> > I'm not actually sure if this is a concern, but just mentioning it in
> > case it is.
> > 
> It will not matter as the handler of the message should be handle case
> as CPU can go offline later after the message even if the CPU was
> offline.
> But I think we can avoid this situation.
> 
> > > +
> > > +       /*
> > > +        * On most systems, all CPUs in the package receive a
> > > package-level
> > > +        * thermal interrupt when there is an HFI update. It is
> > > sufficient to
> > > +        * let a single CPU to acknowledge the update and schedule
> > > work to
> > > +        * process it. The remaining CPUs can resume their work.
> > > +        */
> > > +       if (!raw_spin_trylock(&hfi_instance->event_lock))
> > > +               return;
> > > +
> > > +       /* Skip duplicated updates. */
> > > +       new_timestamp = *(u64 *)hfi_instance->hw_table;
> > > +       if (*hfi_instance->timestamp == new_timestamp) {
> > > +               raw_spin_unlock(&hfi_instance->event_lock);
> > > +               return;
> > > +       }
> > > +
> > > +       raw_spin_lock(&hfi_instance->table_lock);
> > > +
> > > +       /*
> > > +        * Copy the updated table into our local copy. This
> > > includes the new
> > > +        * timestamp.
> > > +        */
> > > +       memcpy(hfi_instance->local_table, hfi_instance->hw_table,
> > > +              hfi_features.nr_table_pages << PAGE_SHIFT);
> > > +
> > > +       raw_spin_unlock(&hfi_instance->table_lock);
> > > +       raw_spin_unlock(&hfi_instance->event_lock);
> > > +
> > > +       /*
> > > +        * Let hardware know that we are done reading the HFI table
> > > and it is
> > > +        * free to update it again.
> > > +        */
> > > +       pkg_therm_status_msr_val &= THERM_STATUS_CLEAR_PKG_MASK &
> > > +                                   ~PACKAGE_THERM_STATUS_HFI_UPDAT
> > > ED;
> > > +       wrmsrl(MSR_IA32_PACKAGE_THERM_STATUS,
> > > pkg_therm_status_msr_val);
> > > +
> > > +       schedule_delayed_work(&hfi_instance->update_work,
> > > HFI_UPDATE_INTERVAL);
> > 
> > AFAICS, 
> For my understanding:
> 
> > if update_work has been scheduled already,
> queue_delayed_work_on is called
> 
> >  but is not pending
> > yet, the delay will be set to the current time plus
> > HFI_UPDATE_INTERVAL, but shouldn't it actually run earlier in that
> > case?
> 
> 
> "
> 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
> work_data_bits(work))) {
> 		__queue_delayed_work(cpu, wq, dwork, delay);
> 		ret = true;
> 	}
> "
> Pending bit will be set only one time, so delay will be from 
> the first call of queue_delayed_work_on() + HFI_UPDATE_INTERVAL.
> 
> So on subsequent calls of schedule_delayed_work() the delay is always
> with reference to first call.
> 
> > 
> > Also it looks like the processing introduced in the next patch can
> > take quite a bit of time if there is a sufficiently large number of
> > CPUs in the package, so is it suitable for system_wq in all cases?
> > 
> Good question. What is the threshold of not using system_wq?
> With current set of max cpus/package, I did experiments with busy
> systems with a test functions with several workqueues and check if they
> drift in expiry. But I think we can move away from system_wq. Can this
> be add on patch?

Sure, I can do this.

Thanks and BR,
Ricardo