Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Fri, 25 May 2012 01:01:46 +0200 (CEST)

On Thu, 24 May 2012, Alex Williamson wrote:
> On Thu, 2012-05-24 at 18:53 -0300, Jan Kiszka wrote:
> > On 2012-05-24 18:39, Thomas Gleixner wrote:
> > > On Thu, 24 May 2012, Alex Williamson wrote:
> > >> On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote:
> > >>> +            if (address == msi_start + PCI_MSI_DATA_32)
> > >>> +                handle_cfg_write_msi(pci_dev, assigned_dev);
> > >>
> > >> Why didn't we just use range_covers_byte(address, len, pci_dev->msi_cap
> > >> + PCI_MSI_DATA_32) to start with?  But how does this handle the enable
> > >> bit?
> > > 
> > > The problem with the current implementation is that it only changes
> > > the routing if the msi entry goes from masked to unmasked state.
> > > 
> > > Linux does not mask the entries on affinity changes and never did,
> > > neither for MSI nor for MSI-X.
> > > 
> > > I know it's probably not according to the spec, but we can't fix that
> > > retroactively.
> > 
> > For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug,
> > waiting for hardware to dislike this spec violation.
> > 
> > However, if this is the current behavior of such a prominent guest, I
> > guess we have to stop optimizing the QEMU MSI-X code that it only
> > updates routings on mask changes. Possibly other OSes get this wrong too...
> > 
> > Thanks, for the clarification. Should go into the changelog.
> 
> Hmm, if Linux didn't mask MSIX before updating vectors it'd not only be
> a spec violation, but my testing of the recent changes to fix MSIX
> vector updates for exactly this would have failed...
> 
>         } else if (msix_masked(&orig) && !msix_masked(entry)) {
>             ... update vector...
> 
> So I'm not entirely sure I believe that.  Thanks,

What happens is:

A write to /proc/irq/$N/smp_affinity calls into irq_set_affinity()
which does:

	if (irq_can_move_pcntxt(data)) {
		ret = chip->irq_set_affinity(data, mask, false);
	} else {
		irqd_set_move_pending(data);
		irq_copy_pending(desc, mask);
	}

MSI and MSI-X fall into the !irq_can_move_pcntxt() code path unless
the irq is remapped, which is not the case in a guest. That means that
we merily copy the new mask and set the move pending bit. 

MSI/MSI-X use the edge handler so on the next incoming interrupt, we
do

  irq_desc->chip->irq_ack()

which ends up in ack_apic_edge() which does:

static void ack_apic_edge(struct irq_data *data)
{
	irq_complete_move(data->chip_data);
	irq_move_irq(data);
	ack_APIC_irq();
}

irq_move_irq() is the interesting function. And that does

  irq_desc->chip->irq_mask()

before calling the irq_set_affinity() function which actually changes
the masks.

->irq_mask() ends up in mask_msi_irq(). 

Now that calls msi_set_mask_bit() and for MSI-X that actually masks
the irq. So ignore my MSI-X comment.

But for MSI this ends up in msi_mask_irq() which does:

        if (!desc->msi_attrib.maskbit)
	   return 0;

So in case desc->msi_attrib.maskbit is 0 we do not write anything out
and then the masked/unmasked logic in qemu fails.

Sorry, that I did not decode that down to this level before, but I was
in a hurry and assumed correctly that qemu is doing something
wrong. Not being familiar with that code did not help either :)

So the proper fix is that qemu tells the guest that mask bit is
supported and catches the mask bit toggling before writing it out to
the hardware for those devices which do not support it.

We'll have another look.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html