On Wed, 2022-01-12 at 20:47 +0100, Rafael J. Wysocki wrote: > On Sat, Jan 8, 2022 at 4:46 AM Ricardo Neri > <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote: > > When hardware wants to inform the operating system about updates in > > the HFI > > table, it issues a package-level thermal event interrupt. For this, > > hardware has new interrupt and status bits in the > > IA32_PACKAGE_THERM_ > > INTERRUPT and IA32_PACKAGE_THERM_STATUS registers. The existing > > thermal > > throttle driver already handles thermal event interrupts: it > > initializes > > the thermal vector of the local APIC as well as per-CPU and > > package-level > > interrupt reporting. It also provides routines to service such > > interrupts. > > Extend its functionality to also handle HFI interrupts. > > > > The frequency of the thermal HFI interrupt is specific to each > > processor > > model. On some processors, a single interrupt happens as soon as > > the HFI is > > enabled and hardware will never update HFI capabilities afterwards. > > On > > other processors, thermal and power constraints may cause thermal > > HFI > > interrupts every tens of milliseconds. > > > > To not overwhelm consumers of the HFI data, use delayed work to > > throttle > > the rate at which HFI updates are processed. > > > > [...] > > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val) > > +{ > > + struct hfi_instance *hfi_instance; > > + int cpu = smp_processor_id(); > > + struct hfi_cpu_info *info; > > + u64 new_timestamp; > > + > > + if (!pkg_therm_status_msr_val) > > + return; > > + > > + info = &per_cpu(hfi_cpu_info, cpu); > > + if (!info) > > + return; > > + > > + /* > > + * It is possible that we get an HFI thermal interrupt on > > this CPU > > + * before its HFI instance is initialized. Although this code can handle this situation, you can avoid this. You can call intel_hfi_online(cpu) before " l = apic_read(APIC_LVTTHMR); apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED); " in thermal_throttle_online() In the same way call intel_hfi_offline(cpu) after /* Mask the thermal vector before draining evtl. pending work */ l = apic_read(APIC_LVTTHMR); apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED); in thermal_throttle_offline() > > This is not a problem. The > > + * CPU that enabled the interrupt for this package will > > also get the > > + * interrupt and is fully initialized. > > + */ > > + hfi_instance = info->hfi_instance; > > + if (!hfi_instance) > > + return; > > Generally, a CPU whose info has been initialized can be offline, so > this code may run on an offline CPU. > > I'm not actually sure if this is a concern, but just mentioning it in > case it is. > It will not matter as the handler of the message should be handle case as CPU can go offline later after the message even if the CPU was offline. But I think we can avoid this situation. > > + > > + /* > > + * On most systems, all CPUs in the package receive a > > package-level > > + * thermal interrupt when there is an HFI update. It is > > sufficient to > > + * let a single CPU to acknowledge the update and schedule > > work to > > + * process it. The remaining CPUs can resume their work. > > + */ > > + if (!raw_spin_trylock(&hfi_instance->event_lock)) > > + return; > > + > > + /* Skip duplicated updates. */ > > + new_timestamp = *(u64 *)hfi_instance->hw_table; > > + if (*hfi_instance->timestamp == new_timestamp) { > > + raw_spin_unlock(&hfi_instance->event_lock); > > + return; > > + } > > + > > + raw_spin_lock(&hfi_instance->table_lock); > > + > > + /* > > + * Copy the updated table into our local copy. This > > includes the new > > + * timestamp. > > + */ > > + memcpy(hfi_instance->local_table, hfi_instance->hw_table, > > + hfi_features.nr_table_pages << PAGE_SHIFT); > > + > > + raw_spin_unlock(&hfi_instance->table_lock); > > + raw_spin_unlock(&hfi_instance->event_lock); > > + > > + /* > > + * Let hardware know that we are done reading the HFI table > > and it is > > + * free to update it again. > > + */ > > + pkg_therm_status_msr_val &= THERM_STATUS_CLEAR_PKG_MASK & > > + ~PACKAGE_THERM_STATUS_HFI_UPDAT > > ED; > > + wrmsrl(MSR_IA32_PACKAGE_THERM_STATUS, > > pkg_therm_status_msr_val); > > + > > + schedule_delayed_work(&hfi_instance->update_work, > > HFI_UPDATE_INTERVAL); > > AFAICS, For my understanding: > if update_work has been scheduled already, queue_delayed_work_on is called > but is not pending > yet, the delay will be set to the current time plus > HFI_UPDATE_INTERVAL, but shouldn't it actually run earlier in that > case? " if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { __queue_delayed_work(cpu, wq, dwork, delay); ret = true; } " Pending bit will be set only one time, so delay will be from the first call of queue_delayed_work_on() + HFI_UPDATE_INTERVAL. So on subsequent calls of schedule_delayed_work() the delay is always with reference to first call. > > Also it looks like the processing introduced in the next patch can > take quite a bit of time if there is a sufficiently large number of > CPUs in the package, so is it suitable for system_wq in all cases? > Good question. What is the threshold of not using system_wq? With current set of max cpus/package, I did experiments with busy systems with a test functions with several workqueues and check if they drift in expiry. But I think we can move away from system_wq. Can this be add on patch? Thanks, Srinivas > > +} > > + > > static void init_hfi_cpu_index(struct hfi_cpu_info *info) > > { > > union cpuid6_edx edx; > > @@ -259,8 +350,20 @@ void intel_hfi_online(unsigned int cpu) > > > > init_hfi_instance(hfi_instance); > > > > + INIT_DELAYED_WORK(&hfi_instance->update_work, > > hfi_update_work_fn); > > + raw_spin_lock_init(&hfi_instance->table_lock); > > + raw_spin_lock_init(&hfi_instance->event_lock); > > + > > cpumask_set_cpu(cpu, hfi_instance->cpus); > > > > + /* > > + * Enable the hardware feedback interface and never disable > > it. See > > + * comment on programming the address of the table. > > + */ > > + rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val); > > + msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT; > > + wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val); > > + > > unlock: > > mutex_unlock(&hfi_instance_lock); > > return; > > diff --git a/drivers/thermal/intel/intel_hfi.h > > b/drivers/thermal/intel/intel_hfi.h > > index 56c6b2d75202..325aa78b745c 100644 > > --- a/drivers/thermal/intel/intel_hfi.h > > +++ b/drivers/thermal/intel/intel_hfi.h > > @@ -6,10 +6,12 @@ > > void __init intel_hfi_init(void); > > void intel_hfi_online(unsigned int cpu); > > void intel_hfi_offline(unsigned int cpu); > > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val); > > #else > > static inline void intel_hfi_init(void) { } > > static inline void intel_hfi_online(unsigned int cpu) { } > > static inline void intel_hfi_offline(unsigned int cpu) { } > > +static inline void intel_hfi_process_event(__u64 > > pkg_therm_status_msr_val) { } > > #endif /* CONFIG_INTEL_HFI_THERMAL */ > > > > #endif /* _INTEL_HFI_H */ > > diff --git a/drivers/thermal/intel/therm_throt.c > > b/drivers/thermal/intel/therm_throt.c > > index 2a79598a7f7a..930e19eeeac6 100644 > > --- a/drivers/thermal/intel/therm_throt.c > > +++ b/drivers/thermal/intel/therm_throt.c > > @@ -619,6 +619,10 @@ void intel_thermal_interrupt(void) > > PACKAGE_THERM_STATUS_POWER_ > > LIMIT, > > POWER_LIMIT_EVENT, > > PACKAGE_LEVEL); > > + > > + if (this_cpu_has(X86_FEATURE_HFI)) > > + intel_hfi_process_event(msr_val & > > + PACKAGE_THERM_STATU > > S_HFI_UPDATED); > > } > > } > > > > @@ -728,6 +732,12 @@ void intel_init_thermal(struct cpuinfo_x86 *c) > > wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, > > l | (PACKAGE_THERM_INT_LOW_ENABLE > > | PACKAGE_THERM_INT_HIGH_ENABLE), > > h); > > + > > + if (cpu_has(c, X86_FEATURE_HFI)) { > > + rdmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, l, > > h); > > + wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, > > + l | PACKAGE_THERM_INT_HFI_ENABLE, h); > > + } > > } > > > > rdmsr(MSR_IA32_MISC_ENABLE, l, h); > > -- > > 2.17.1 > >