On Sun, Aug 6, 2023 at 8:22 PM Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> wrote: > > On Fri, Aug 04, 2023 at 04:46:11PM -0700, David Dai wrote: > > Hi Pavan, > > > > Thanks for reviewing! > > > > On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> wrote: > > > > > > On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote: > > > > Introduce a virtualized cpufreq driver for guest kernels to improve > > > > performance and power of workloads within VMs. > > > > > > > > This driver does two main things: > > > > > > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > > > > hint to schedule the vCPU threads and decide physical CPU frequency. > > > > > > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > > > > host CPU frequency by reading a MMIO region of a virtual cpufreq device > > > > to update the guest's frequency scaling factor periodically. This enables > > > > accurate Per-Entity Load Tracking for tasks running in the guest. > > > > > > > > Co-developed-by: Saravana Kannan <saravanak@xxxxxxxxxx> > > > > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx> > > > > Signed-off-by: David Dai <davidai@xxxxxxxxxx> > > > > > > [...] > > > > > > > +static void virt_scale_freq_tick(void) > > > > +{ > > > > + struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id()); > > > > + struct virt_cpufreq_drv_data *data = policy->driver_data; > > > > + u32 max_freq = (u32)policy->cpuinfo.max_freq; > > > > + u64 cur_freq; > > > > + u64 scale; > > > > + > > > > + cpufreq_cpu_put(policy); > > > > + > > > > + cur_freq = (u64)data->ops->get_freq(policy); > > > > + cur_freq <<= SCHED_CAPACITY_SHIFT; > > > > + scale = div_u64(cur_freq, max_freq); > > > > + > > > > + this_cpu_write(arch_freq_scale, (unsigned long)scale); > > > > +} > > > > + > > > > > > We expect the host to provide the frequency in kHz, can you please add a > > > comment about it. It is not very obvious when you look at the > > > REG_CUR_FREQ_OFFSET register name. > > > > I’ll include a KHZ in the offset names. > > > Hi Pavan, Apologies for the slow responses, I was out on vacation for the week prior to last week. > Sure, that would help. Also, can you limit the scale to > SCHED_CAPACITY_SCALE? It may be possible that host may be running at a > higher frequency than max_freq advertised on this guest. Good catch, will include a check to limit freq_scale to SCHED_CAPACITY_SCALE. > > > > > > > > + > > > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy, > > > > + unsigned int target_freq) > > > > +{ > > > > + virt_cpufreq_set_perf(policy); > > > > + return target_freq; > > > > +} > > > > + > > > > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy, > > > > + unsigned int index) > > > > +{ > > > > + return virt_cpufreq_set_perf(policy); > > > > +} > > > > + > > > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy) > > > > +{ > > > > + struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data(); > > > > + struct cpufreq_frequency_table *table; > > > > + struct device *cpu_dev; > > > > + int ret; > > > > + > > > > + cpu_dev = get_cpu_device(policy->cpu); > > > > + if (!cpu_dev) > > > > + return -ENODEV; > > > > + > > > > + ret = dev_pm_opp_of_add_table(cpu_dev); > > > > + if (ret) > > > > + return ret; > > > > + > > > > + ret = dev_pm_opp_get_opp_count(cpu_dev); > > > > + if (ret <= 0) { > > > > + dev_err(cpu_dev, "OPP table can't be empty\n"); > > > > + return -ENODEV; > > > > + } > > > > + > > > > + ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table); > > > > + if (ret) { > > > > + dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret); > > > > + return ret; > > > > + } > > > > + > > > > + policy->freq_table = table; > > > > + policy->dvfs_possible_from_any_cpu = false; > > > > + policy->fast_switch_possible = true; > > > > + policy->driver_data = drv_data; > > > > + > > > > + /* > > > > + * Only takes effect if another FIE source such as AMUs > > > > + * have not been registered. > > > > + */ > > > > + topology_set_scale_freq_source(&virt_sfd, policy->cpus); > > > > + > > > > + return 0; > > > > + > > > > +} > > > > + > > > > > > Do we need to register as FIE source even with the below commit? By > > > registering as a source, we are not supplying any accurate metric. We > > > still fallback on the same source that cpufreq implements it. > > > > The arch_set_freq_scale() done at cpufreq driver’s frequency updates > > at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only > > represent the guest’s frequency request. However, this does not > > accurately represent the physical CPU’s frequency that the vCPU is > > running on. E.g. There may be other processes sharing the same > > physical CPU that results in a much higher CPU frequency than what’s > > requested by the vCPU. > > > > understood that policy->cur may not reflect the actual frequency. Is this > something needs to be advertised to cpufreq core so that it query the > underlying cpufreq driver and use it for frequency scale updates. This > also gives userspace to read the actual frequency when read from sysfs. > Adding a ->get() callback in the driver to advertise to the cpufreq core does not actually update the freq_scale if fast_switch is enabled. Since fast_switch is required for performance reasons, I don’t see value in adding ->get(). > In fact, cpufreq_driver_fast_switch() comment says that > cpufreq_driver::fast_switch() should return the *actual* frequency and > the same is used to update frequency scale updates. I understand that it > depends on other things like if host defer the frequency switch, the > value read from REG_CUR_FREQ_OFFSET may reflect the old value.. > > May be a comment in code would help. > Sounds good, I'll include a comment. Thanks, David > Thanks, > Pavan > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx. >