On Wed, Jan 12, 2022 at 03:26:41PM -0800, Srinivas Pandruvada wrote: > On Wed, 2022-01-12 at 20:47 +0100, Rafael J. Wysocki wrote: > > On Sat, Jan 8, 2022 at 4:46 AM Ricardo Neri > > <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote: > > > When hardware wants to inform the operating system about updates in > > > the HFI > > > table, it issues a package-level thermal event interrupt. For this, > > > hardware has new interrupt and status bits in the > > > IA32_PACKAGE_THERM_ > > > INTERRUPT and IA32_PACKAGE_THERM_STATUS registers. The existing > > > thermal > > > throttle driver already handles thermal event interrupts: it > > > initializes > > > the thermal vector of the local APIC as well as per-CPU and > > > package-level > > > interrupt reporting. It also provides routines to service such > > > interrupts. > > > Extend its functionality to also handle HFI interrupts. > > > > > > The frequency of the thermal HFI interrupt is specific to each > > > processor > > > model. On some processors, a single interrupt happens as soon as > > > the HFI is > > > enabled and hardware will never update HFI capabilities afterwards. > > > On > > > other processors, thermal and power constraints may cause thermal > > > HFI > > > interrupts every tens of milliseconds. > > > > > > To not overwhelm consumers of the HFI data, use delayed work to > > > throttle > > > the rate at which HFI updates are processed. > > > > > > > > [...] > > > > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val) > > > +{ > > > + struct hfi_instance *hfi_instance; > > > + int cpu = smp_processor_id(); > > > + struct hfi_cpu_info *info; > > > + u64 new_timestamp; > > > + > > > + if (!pkg_therm_status_msr_val) > > > + return; > > > + > > > + info = &per_cpu(hfi_cpu_info, cpu); > > > + if (!info) > > > + return; > > > + > > > + /* > > > + * It is possible that we get an HFI thermal interrupt on > > > this CPU > > > + * before its HFI instance is initialized. > Although this code can handle this situation, you can avoid this. > > You can call intel_hfi_online(cpu) before > " > l = apic_read(APIC_LVTTHMR); > apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED); > " Indeed this could work. intel_hfi_online() also enables the HFI. It is possible to get an HFI interrupt as soon as we enable it. I was concerned about getting the interrupt with APIC_LVTTHMR masked and miss it. This should not be a problem since we will get it as soon as we unmask it. > in thermal_throttle_online() > > In the same way call intel_hfi_offline(cpu) > after > > /* Mask the thermal vector before draining evtl. pending work */ > l = apic_read(APIC_LVTTHMR); > apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED); > > in thermal_throttle_offline() Sure, I can do this. > > > > > This is not a problem. The > > > + * CPU that enabled the interrupt for this package will > > > also get the > > > + * interrupt and is fully initialized. > > > + */ > > > + hfi_instance = info->hfi_instance; > > > + if (!hfi_instance) > > > + return; > > > > Generally, a CPU whose info has been initialized can be offline, so > > this code may run on an offline CPU. > > > > I'm not actually sure if this is a concern, but just mentioning it in > > case it is. > > > It will not matter as the handler of the message should be handle case > as CPU can go offline later after the message even if the CPU was > offline. > But I think we can avoid this situation. > > > > + > > > + /* > > > + * On most systems, all CPUs in the package receive a > > > package-level > > > + * thermal interrupt when there is an HFI update. It is > > > sufficient to > > > + * let a single CPU to acknowledge the update and schedule > > > work to > > > + * process it. The remaining CPUs can resume their work. > > > + */ > > > + if (!raw_spin_trylock(&hfi_instance->event_lock)) > > > + return; > > > + > > > + /* Skip duplicated updates. */ > > > + new_timestamp = *(u64 *)hfi_instance->hw_table; > > > + if (*hfi_instance->timestamp == new_timestamp) { > > > + raw_spin_unlock(&hfi_instance->event_lock); > > > + return; > > > + } > > > + > > > + raw_spin_lock(&hfi_instance->table_lock); > > > + > > > + /* > > > + * Copy the updated table into our local copy. This > > > includes the new > > > + * timestamp. > > > + */ > > > + memcpy(hfi_instance->local_table, hfi_instance->hw_table, > > > + hfi_features.nr_table_pages << PAGE_SHIFT); > > > + > > > + raw_spin_unlock(&hfi_instance->table_lock); > > > + raw_spin_unlock(&hfi_instance->event_lock); > > > + > > > + /* > > > + * Let hardware know that we are done reading the HFI table > > > and it is > > > + * free to update it again. > > > + */ > > > + pkg_therm_status_msr_val &= THERM_STATUS_CLEAR_PKG_MASK & > > > + ~PACKAGE_THERM_STATUS_HFI_UPDAT > > > ED; > > > + wrmsrl(MSR_IA32_PACKAGE_THERM_STATUS, > > > pkg_therm_status_msr_val); > > > + > > > + schedule_delayed_work(&hfi_instance->update_work, > > > HFI_UPDATE_INTERVAL); > > > > AFAICS, > For my understanding: > > > if update_work has been scheduled already, > queue_delayed_work_on is called > > > but is not pending > > yet, the delay will be set to the current time plus > > HFI_UPDATE_INTERVAL, but shouldn't it actually run earlier in that > > case? > > > " > if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, > work_data_bits(work))) { > __queue_delayed_work(cpu, wq, dwork, delay); > ret = true; > } > " > Pending bit will be set only one time, so delay will be from > the first call of queue_delayed_work_on() + HFI_UPDATE_INTERVAL. > > So on subsequent calls of schedule_delayed_work() the delay is always > with reference to first call. > > > > > Also it looks like the processing introduced in the next patch can > > take quite a bit of time if there is a sufficiently large number of > > CPUs in the package, so is it suitable for system_wq in all cases? > > > Good question. What is the threshold of not using system_wq? > With current set of max cpus/package, I did experiments with busy > systems with a test functions with several workqueues and check if they > drift in expiry. But I think we can move away from system_wq. Can this > be add on patch? Sure, I can do this. Thanks and BR, Ricardo