Hi Prarit, (2010/12/10 9:54), Prarit Bhargava wrote: > I have several systems which have the same problem: The PCIe AER corrected and > uncorrected masks have all the error bits set. This results in the inablility > to test with the aer_inject module & utility on those systems. > > Add the 'aer_mask_override' module parameter which will override the corrected > or uncorrected masks for a PCI device. The mask will have the bit > corresponding to the status passed into the aer_inject() function. > > After this patch it is possible to successfully use the aer_inject utility > on those PCI slots. > > Successfully tested by me on a Dell and Intel whitebox which exhibited the > mask problem. > > Signed-off-by: Prarit Bhargava <prarit@xxxxxxxxxx> > Which is your goal?: 1) Allow changing mask while test, to inject errors originally masked 2) Allow setting an error bit even if it is masked Did you refer commit b49bfd32901625e4adcfee011d2b32a43b4db67d ? > diff --git a/drivers/pci/pcie/aer/aer_inject.c b/drivers/pci/pcie/aer/aer_inject.c > index b3cf622..b4f1a71 100644 > --- a/drivers/pci/pcie/aer/aer_inject.c > +++ b/drivers/pci/pcie/aer/aer_inject.c > @@ -27,6 +27,10 @@ > #include <linux/stddef.h> > #include "aerdrv.h" > > +/* Override the existing corrected and uncorrected error masks */ > +static int aer_mask_override; > +module_param(aer_mask_override, bool, 0); > + > struct aer_error_inj { > u8 bus; > u8 dev; > @@ -322,7 +326,7 @@ static int aer_inject(struct aer_error_inj *einj) > unsigned long flags; > unsigned int devfn = PCI_DEVFN(einj->dev, einj->fn); > int pos_cap_err, rp_pos_cap_err; > - u32 sever, cor_mask, uncor_mask; > + u32 sever, cor_mask, uncor_mask, cor_mask_orig, uncor_mask_orig; > int ret = 0; > > dev = pci_get_domain_bus_and_slot((int)einj->domain, einj->bus, devfn); > @@ -344,6 +348,18 @@ static int aer_inject(struct aer_error_inj *einj) > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, > &uncor_mask); > > + if (aer_mask_override) { > + cor_mask_orig = cor_mask; > + cor_mask &= !(einj->cor_status); > + pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, > + cor_mask); > + > + uncor_mask_orig = uncor_mask; > + uncor_mask &= !(einj->uncor_status); > + pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, > + uncor_mask); > + } > + > rp_pos_cap_err = pci_find_ext_capability(rpdev, PCI_EXT_CAP_ID_ERR); > if (!rp_pos_cap_err) { > ret = -ENOTTY; This config write actually changes masks of the real device, while injected error bits like einj->cor_status are just data on memory to be accessed from hooked ops, pci_read/write_aer(). I'm not sure whether it is always safe to revert the preset masks of real devices, even if it is only for a test in a short time. One of other approaches is making new fields like einj->cor_mask to "fake" mask registers like other aer registers. > @@ -378,14 +394,16 @@ static int aer_inject(struct aer_error_inj *einj) > err->header_log2 = einj->header_log2; > err->header_log3 = einj->header_log3; > > - if (einj->cor_status && !(einj->cor_status & ~cor_mask)) { > + if (!aer_mask_override && einj->cor_status && > + !(einj->cor_status & ~cor_mask)) { > ret = -EINVAL; > printk(KERN_WARNING "The correctable error(s) is masked " > "by device\n"); > spin_unlock_irqrestore(&inject_lock, flags); > goto out_put; > } > - if (einj->uncor_status && !(einj->uncor_status & ~uncor_mask)) { > + if (!aer_mask_override && einj->uncor_status && > + !(einj->uncor_status & ~uncor_mask)) { > ret = -EINVAL; > printk(KERN_WARNING "The uncorrectable error(s) is masked " > "by device\n"); If your answer for my previous question is 2), isn't it enough for you to have a flag like aer_allow_inject_masked_error or so for these blocks? > @@ -425,6 +443,13 @@ static int aer_inject(struct aer_error_inj *einj) > } > spin_unlock_irqrestore(&inject_lock, flags); > > + if (aer_mask_override) { > + pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, > + cor_mask_orig); > + pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, > + uncor_mask_orig); > + } > + > ret = pci_bus_set_aer_ops(dev->bus); > if (ret) > goto out_put; The aer_irq() at few lines after here will send an aer event to root port, as like as an error message interrupt from a dangling device which detected an unmasked error. Since here you restored masks of the testing device before processing event by root port, the root port might doubt the device is right source of error event if it have no unmasked error. Also the error handler of the testing device might not be called, or be called with no unmasked errors. ... Is this what you have expected? Thanks, H.Seto -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html