On Thu, Sep 30, 2010 at 04:09:39PM -0700, Tom Lyon wrote: > On Monday, September 27, 2010 04:54:21 am Michael S. Tsirkin wrote: > > On Wed, Sep 22, 2010 at 02:18:24PM -0700, Tom Lyon wrote: > > > Signed-off-by: Tom Lyon <pugs@xxxxxxxxx> > > > > Some comments on the pci bits: > > > > After going over them for the Nth time - something needs to be done > > with the rvirt/write tables. I doubt anyone besides me and you > > has gone over them: one has to pull > > up the spec just to understand which bits are set here. Why can't we > > have a module init routine and use the macros in pci_regs.h for this > > purpose, as all other drivers do? We will also get to use the nice > > endian-ness macros linux has for portability ... > > I tried a couple of approaches before settling on this one. Yes, its dense, > perhaps its ugly, but a table approach beats open coding of rules. Sure. But fill the table in an init routine in code. > And I absolutely don't expect it to verifiable to someone without a PCI spec; > I had to have the PCI spec to write it! This is an example of where > correctness and readability pull to different places. And the defines in > pci_regs are incomplete and not real self-consistent anyways. > > Perhaps its time for someone to code up a different approach? Code talks! I'll try to find the time to show the basic idea if I fail to convey the meaning by mail. > > > > > +static struct perm_bits pci_cap_basic_perm[] = { > > > > What endian-ness is this? Native? > > You pass this to user as is but pci config is little endian... > > Also -are all these readonly? So const? read mostly? > Yes, I need some consts. > Endianness - I haven't really thought much and I don't have any big-endian > machines to test with. However, for now everything is byte-at-a-time which > makes things easier. The tables are built up of 32 bit integers though. > I'll take a pass at cleaning it up. > > > > > + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */ > > > + { 0x00000003, 0xFFFFFFFF, }, /* 0x04 cmd - mem & io bits virt */ > > > > Interrupt disable bit is under host control. > Yes? But you make it writeable from guest. So the value will change under the feet of host/guest. > > > > > + { 0, 0, }, /* 0x08 class code & revision id */ > > > + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */ > > > + { 0, 0, }, /* 0x28 cardbus - not yet */ > > > + { 0, 0, }, /* 0x2c subsys vendor & dev */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */ > > > + { 0, 0, }, /* 0x34 capability ptr & resv */ > > > + { 0, 0, }, /* 0x38 resv */ > > > + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */ > > > +}; > > > + > > > +static struct perm_bits pci_cap_pm_perm[] = { > > > + { 0, 0, }, /* 0x00 PM capabilities */ > > > + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */ > > > +}; > > > + > > > +static struct perm_bits pci_cap_vpd_perm[] = { > > > + { 0, 0xFFFF0000, }, /* 0x00 address */ > > > + { 0, 0xFFFFFFFF, }, /* 0x04 data */ > > > +}; > > > + > > > +static struct perm_bits pci_cap_slotid_perm[] = { > > > + { 0, 0, }, /* 0x00 all read only */ > > > +}; > > > + > > > +/* 4 different possible layouts of MSI capability */ > > > +static struct perm_bits pci_cap_msi_10_perm[] = { > > > + { 0x00FF0000, 0x00FF0000, }, /* 0x00 MSI message control */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */ > > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */ > > > +}; > > > +static struct perm_bits pci_cap_msi_14_perm[] = { > > > + { 0x00FF0000, 0x00FF0000, }, /* 0x00 MSI message control */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */ > > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */ > > > +}; > > > +static struct perm_bits pci_cap_msi_20_perm[] = { > > > + { 0x00FF0000, 0x00FF0000, }, /* 0x00 MSI message control */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */ > > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */ > > > + { 0, 0xFFFFFFFF, }, /* 0x0c MSI mask bits */ > > > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI pending bits */ > > > +}; > > > +static struct perm_bits pci_cap_msi_24_perm[] = { > > > + { 0x00FF0000, 0x00FF0000, }, /* 0x00 MSI message control */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */ > > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */ > > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */ > > > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */ > > > + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */ > > > +}; > > > > You let guest control MSI mask and do not virtualize, > > but the host expects to control this register - it is > > used for masking during interrupt rebalancing. > > This will break the device, or even the host with > > 'unsafe interrupts' enabled. > > > > Note that if you virtualize it, you will have to virtualize > > the pending bits too. > I can't find any interaction between MSIs and irqbalance. I thought it was all > magic in the APIC. Please point me at counter-example. > Otherwise I don't see a problem with the guest controlling the mask. See mask_msi_irq/unmask_msi_irq. > > > + > > > +static struct perm_bits pci_cap_pcix_perm[] = { > > > + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */ > > > + { 0, 0, }, /* 0x04 PCI_X_STATUS */ > > > + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */ > > > + { 0, 0, }, /* 0x0c ECC first addr */ > > > + { 0, 0, }, /* 0x10 ECC second addr */ > > > + { 0, 0, }, /* 0x14 ECC attr */ > > > +}; > > > + > > > +/* pci express capabilities */ > > > +static struct perm_bits pci_cap_exp_perm[] = { > > > + { 0, 0, }, /* 0x00 PCIe capabilities */ > > > + { 0, 0, }, /* 0x04 PCIe device capabilities */ > > > + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */ > > > + { 0, 0, }, /* 0x0c PCIe link capabilities */ > > > + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */ > > > + { 0, 0, }, /* 0x14 PCIe slot capabilities */ > > > + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */ > > > + { 0, 0, }, /* 0x1c PCIe root port stuff */ > > > + { 0, 0, }, /* 0x20 PCIe root port stuff */ > > > +}; > > > + > > > +static struct perm_bits pci_cap_msix_perm[] = { > > > + { 0, 0, }, /* 0x00 MSI-X Enable */ > > > + { 0, 0, }, /* 0x04 table offset & bir */ > > > + { 0, 0, }, /* 0x08 pba offset & bir */ > > > +}; > > > + > > > +static struct perm_bits pci_cap_af_perm[] = { > > > + { 0, 0, }, /* 0x00 af capability */ > > > + { 0, 0x0001, }, /* 0x04 af flr bit */ > > > +}; > > > + > > > +static struct perm_bits *pci_cap_perms[] = { > > > + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm, > > > + [PCI_CAP_ID_PM] = pci_cap_pm_perm, > > > + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm, > > > + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm, > > > + [PCI_CAP_ID_MSI] = NULL, /* special */ > > > + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm, > > > + [PCI_CAP_ID_EXP] = pci_cap_exp_perm, > > > + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm, > > > + [PCI_CAP_ID_AF] = pci_cap_af_perm, > > > +}; > > > > > > > > > > > > > > > +int vfio_build_config_map(struct vfio_dev *vdev) > > > +{ > > > + struct pci_dev *pdev = vdev->pdev; > > > + u8 *map; > > > + int i, len; > > > + u8 pos, cap, tmp; > > > + u16 flags; > > > + int ret; > > > +#ifndef PCI_FIND_CAP_TTL > > > +#define PCI_FIND_CAP_TTL 48 > > > +#endif > > > > Could you add a comment to explain what is the TTL for? > > Where do you get the number 48? > > Why do you have the ifndef here? > > Why does it start with PCI_? > The TTL just keeps the code from looping forever if the chain in the pci > device is looped. The name and constant are copied from pci/pci.c Better to simply use pci_find_cap rather than copy code. > > > + int loops = PCI_FIND_CAP_TTL; > > > + > > > + map = kmalloc(pdev->cfg_size, GFP_KERNEL); > > > > So this will allocate a 4K for express, but you > > never even look at express capabilities, so > > you will never even touch them? > Someday this'll probably be extended for PCIe capabilities. > > > > > + if (map == NULL) > > > + return -ENOMEM; > > > + for (i = 0; i < pdev->cfg_size; i++) > > > + map[i] = 0xFF; > > > + vdev->pci_config_map = map; > > > + > > > + /* default config space */ > > > + for (i = 0; i < pci_capability_length[0]; i++) > > > + map[i] = 0; > > > + > > > + /* any capabilities? */ > > > + ret = pci_read_config_word(pdev, PCI_STATUS, &flags); > > > + if (ret < 0) > > > + return ret; > > > + if ((flags & PCI_STATUS_CAP_LIST) == 0) > > > + return 0; > > > + > > > + ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos); > > > + if (ret < 0) > > > + return ret; > > > + while (pos && --loops > 0) { > > > + ret = pci_read_config_byte(pdev, pos, &cap); > > > + if (ret < 0) > > > + return ret; > > > + if (cap == 0) { > > > + printk(KERN_WARNING "%s: cap 0\n", __func__); > > > + break; > > > + } > > > + if (cap > PCI_CAP_ID_MAX) { > > > + printk(KERN_WARNING "%s: unknown pci capability id %x\n", > > > + __func__, cap); > > > + len = 0; > > > + } else > > > + len = pci_capability_length[cap]; > > > + if (len == 0) { > > > + printk(KERN_WARNING "%s: unknown length for pci cap %x\n", > > > + __func__, cap); > > > + len = 4; > > > + } > > > + if (len == 0xFF) { > > > + switch (cap) { > > > + case PCI_CAP_ID_MSI: > > > + len = vfio_msi_cap_len(vdev, pos); > > > + if (len < 0) > > > + return len; > > > + break; > > > + case PCI_CAP_ID_PCIX: > > > + ret = pci_read_config_word(pdev, pos + 2, > > > + &flags); > > > + if (ret < 0) > > > + return ret; > > > + if (flags & 0x3000) > > > + len = 24; > > > + else > > > + len = 8; > > > + break; > > > + case PCI_CAP_ID_VNDR: > > > + /* length follows next field */ > > > + ret = pci_read_config_byte(pdev, pos + 2, &tmp); > > > + if (ret < 0) > > > + return ret; > > > + len = tmp; > > > + break; > > > + default: > > > + len = 0; > > > + break; > > > + } > > > + } > > > + > > > + for (i = 0; i < len; i++) { > > > + if (map[pos+i] != 0xFF) > > > + printk(KERN_WARNING > > > + "%s: pci config conflict at %x, " > > > + "caps %x %x\n", > > > + __func__, i, map[pos+i], cap); > > > + map[pos+i] = cap; > > > + } > > > + ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos); > > > + if (ret < 0) > > > + return ret; > > > + } > > > + if (loops <= 0) > > > + printk(KERN_ERR "%s: config space loop!\n", __func__); > > > > If you just want to find all predefined capabilities, > > you should porobably just use pci_find_cap. > But this is a lot quicker. Who cares? Come on, this is a ton of duplicated code. Using standard APIs you will get e.g. express support almost for free. > > > > > + return 0; > > > +} > > > + > > > +static int vfio_virt_init(struct vfio_dev *vdev) > > > +{ > > > + struct pci_dev *pdev = vdev->pdev; > > > + u32 *lp; > > > + int i; > > > + > > > + vdev->vconfig = kmalloc(256, GFP_KERNEL); > > > + if (vdev->vconfig == NULL) > > > + return -ENOMEM; > > > > This one is only 256. It looks like it's sometimes > > indexed by position which is up to cfg_size, > > so it can crash for pci express. > No, because the capability maps don't do pcie yet, but I'll clean this up. > > > > > + > > > + lp = (u32 *)vdev->vconfig; > > > + for (i = 0; i < 256/sizeof(u32); i++, lp++) > > > + pci_read_config_dword(pdev, i * sizeof(u32), lp); > > > + vdev->bardirty = 1; > > > + > > > + vdev->rbar[0] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_0]; > > > + vdev->rbar[1] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_1]; > > > + vdev->rbar[2] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_2]; > > > + vdev->rbar[3] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_3]; > > > + vdev->rbar[4] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_4]; > > > + vdev->rbar[5] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_5]; > > > + vdev->rbar[6] = *(u32 *)&vdev->vconfig[PCI_ROM_ADDRESS]; > > > + > > > + /* for sr-iov devices */ > > > + vdev->vconfig[PCI_VENDOR_ID] = pdev->vendor & 0xFF; > > > + vdev->vconfig[PCI_VENDOR_ID+1] = pdev->vendor >> 8; > > > + vdev->vconfig[PCI_DEVICE_ID] = pdev->device & 0xFF; > > > + vdev->vconfig[PCI_DEVICE_ID+1] = pdev->device >> 8; > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > + * Restore the *real* BARs after we detect a backdoor reset. > > > + * (backdoor = some device specific technique that we didn't catch) > > > + */ > > > +static void vfio_bar_restore(struct vfio_dev *vdev) > > > +{ > > > + printk(KERN_WARNING "%s: restoring real bars\n", __func__); > > > + > > > +#define do_bar(off, which) \ > > > + pci_user_write_config_dword(vdev->pdev, off, vdev->rbar[which]) > > > + > > > + do_bar(PCI_BASE_ADDRESS_0, 0); > > > + do_bar(PCI_BASE_ADDRESS_1, 1); > > > + do_bar(PCI_BASE_ADDRESS_2, 2); > > > + do_bar(PCI_BASE_ADDRESS_3, 3); > > > + do_bar(PCI_BASE_ADDRESS_4, 4); > > > + do_bar(PCI_BASE_ADDRESS_5, 5); > > > + do_bar(PCI_ROM_ADDRESS, 6); > > > +#undef do_bar > > > +} > > > + > > > +/* > > > + * Pretend we're hardware and tweak the values > > > + * of the *virtual* pci BARs to reflect the hardware > > > + * capabilities > > > + */ > > > +static void vfio_bar_fixup(struct vfio_dev *vdev) > > > +{ > > > > So you do this on each read? > > Why don't you mask the appropriate bits on write? > > This is what real hardware does, after all. > > Then you won't need the bardirty field. > > > > > + struct pci_dev *pdev = vdev->pdev; > > > + int bar; > > > + u32 *lp; > > > + u64 mask; > > > + > > > + for (bar = 0; bar <= 5; bar++) { > > > + if (pci_resource_start(pdev, bar)) > > > + mask = ~(pci_resource_len(pdev, bar) - 1); > > > + else > > > + mask = 0; > > > + lp = (u32 *)vdev->vconfig + PCI_BASE_ADDRESS_0 + 4*bar; > > > + *lp &= (u32)mask; > > > + > > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) > > > + *lp |= PCI_BASE_ADDRESS_SPACE_IO; > > > + else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) { > > > + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY; > > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH) > > > + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH; > > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) { > > > + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64; > > > + lp++; > > > + *lp &= (u32)(mask >> 32); > > > + bar++; > > > + } > > > + } > > > + } > > > + > > > + if (pci_resource_start(pdev, PCI_ROM_RESOURCE)) > > > > Not if (pci_resource_len)? > > > > > + mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1); > > > + else > > > + mask = 0; > > > + lp = (u32 *)vdev->vconfig + PCI_ROM_ADDRESS; > > > + *lp &= (u32)mask; > > > + > > > + vdev->bardirty = 0; > > > > Aren't the pci values in little endian format? > > If so doing operations on them in native endianness is wrong. > > sparse generally is good at catching these, but you will > > have to avoid so many type casts and annotate endian-ness > > for it to be of any use. > Will clean. > > > > > +ssize_t vfio_config_readwrite(int write, > > > + struct vfio_dev *vdev, > > > + char __user *buf, > > > + size_t count, > > > + loff_t *ppos) > > > +{ > > > + struct pci_dev *pdev = vdev->pdev; > > > + int done = 0; > > > + int ret; > > > + u16 pos; > > > + > > > + > > > + if (vdev->pci_config_map == NULL) { > > > + ret = vfio_build_config_map(vdev); > > > + if (ret) > > > + goto out; > > > + } > > > + if (vdev->vconfig == NULL) { > > > + ret = vfio_virt_init(vdev); > > > + if (ret) > > > + goto out; > > > + } > > > > Why don't you init all this on open? > Keeps all the hooks in the same code file. Make it a function and call from open then. Less state -> less bugs. > > > > > + > > > + while (count > 0) { > > > + pos = *ppos; > > > + if (pos == pdev->cfg_size) > > > + break; > > > + if (pos > pdev->cfg_size) { > > > + ret = -EINVAL; > > > + goto out; > > > + } > > > + > > > + ret = vfio_config_rwbyte(write, vdev, pos, buf); > > > > So you split each access to 1 byte accesses. Hmm. > > This will break atomicity guarantees - maybe it does not matter, > > but this is why the _userspace accesses are there after all. > > > > Wouldn't it be easier to only support 1, 2 and 4 byte accesses? > > This is what all drivers do after all... > Future optimization. PCI is actually pretty good about not needing > atomics, so 8 and 16 bit processors are happy. Not atomic at the processor level, atomic at the PCI level. > I wanted to settle on the table format and contents before doing the > 2 and 4 byte accesses. > > > > > + > > > + if (ret < 0) > > > + goto out; > > > + buf++; > > > + done++; > > > + count--; > > > + (*ppos)++; > > > + } > > > + ret = done; > > > +out: > > > + return ret; > > > +} -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html