Re: Why does kgd2kfd_interrupt() have to schedule work on a specific CPU?

Alex Deucher <alexdeucher@xxxxxxxxx> · Tue, 27 Jun 2023 11:42:09 -0400



+Felix, Philip

On Tue, Jun 27, 2023 at 4:42 AM Philipp Stanner <pstanner@xxxxxxxxxx> wrote:
>
> Hello folks,
>
> I'm currently trying to learn more about DRM and discovered the
> following code sequence:
>
>
> drivers/gpu/drm/amd/amdkfd/kfd_device.c, Line 824 on 6.4-rc7
>
> static inline void kfd_queue_work(struct workqueue_struct *wq,
>                                   struct work_struct *work)
> {
>         int cpu, new_cpu;
>
>         cpu = new_cpu = smp_processor_id();
>         do {
>                 new_cpu = cpumask_next(new_cpu, cpu_online_mask) % nr_cpu_ids;
>                 if (cpu_to_node(new_cpu) == numa_node_id())
>                         break;
>         } while (cpu != new_cpu);
>
>         queue_work_on(new_cpu, wq, work);
> }
>
> /* This is called directly from KGD at ISR. */
> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry)
> {
>         uint32_t patched_ihre[KFD_MAX_RING_ENTRY_SIZE];
>         bool is_patched = false;
>         unsigned long flags;
>
>         if (!kfd->init_complete)
>                 return;
>
>         if (kfd->device_info.ih_ring_entry_size > sizeof(patched_ihre)) {
>                 dev_err_once(kfd_device, "Ring entry too small\n");
>                 return;
>         }
>
>         spin_lock_irqsave(&kfd->interrupt_lock, flags);
>
>         if (kfd->interrupts_active
>             && interrupt_is_wanted(kfd, ih_ring_entry,
>                                    patched_ihre, &is_patched)
>             && enqueue_ih_ring_entry(kfd,
>                                      is_patched ? patched_ihre : ih_ring_entry))
>                 kfd_queue_work(kfd->ih_wq, &kfd->interrupt_work);
>
>         spin_unlock_irqrestore(&kfd->interrupt_lock, flags);
> }
>
>
> These functions seem to be exclusively invoked by amdgpu_irq_dispatch()
> in amdgpu_irq.c
> At first glance it seems to me that it's just a typical scenario taking
> place here: Interrupt arises, interrupt submits work to wq, then jumps
> back to sleep / former process execution context again.
>
> What I don't understand is why it's apparently important to schedule
> the work on a particular CPU.
>
> It seems that the do-while in kfd_queue_work() is searching for a CPU
> within the same NUMA-Node. Thus I suspect that this is done because
> either
> a) performance requires it or
> b) the work-function needs access to something that's only available
>    within the same node.
>
> I suspect there is an interrupt-related reason why that particular work
> should be enqueued on a specific CPU. Just by reading the code alone I
> can't really figure out why precisely that's necessary, though.
>
> Does someone have any hints for me? :)
>
> Cheers,
> Philipp
>
>