Re: [PATCH] Enable non page boundary BAR device assignment

Alexander Graf <agraf@xxxxxxx> · Thu, 10 Dec 2009 10:52:46 +0100

On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:

> On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
>> On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
>> 
>>> While trying to get device passthrough working with an emulex hba,
>>> kvm refused to pass it through because it has a BAR of 256 bytes:
>>> 
>>>        Region 0: Memory at d2100000 (64-bit, non-prefetchable) [size=4K]
>>>        Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
>>>        Region 4: I/O ports at b100 [size=256]
>>> 
>>> Since the page boundary is an arbitrary optimization to allow 1:1
>>> mapping of physical to virtual addresses, we can still take the old
>>> MMIO callback route.
>>> 
>>> So let's add a second code path that allows for size & 0xFFF != 0
>>> sized regions by looping it through userspace.
>> 
>> That makes sense in general *but* the 4K-aligned check isn't just an
>> optimization, it also has a security implication. Consider the
>> theoretical case where has a multi-function device has BARs for two
>> functions on the same page (within a 4K boundary), and each function
>> is assigned to a different guest. With your current patch both guests
>> will be able to write to each other's BARs. Another case is where a
>> device has a bug and you must not write beyond the BAR or Bad Things
>> Happen. With this patch an *unprivileged* guest could exploit that bug
>> and make bad things happen.
>> 
>> This can be fixed if the slow userspace mmio path checks that all MMIO
>> accesses by a guest fall within the portion of the page that is
>> assigned to it.
> 
> This patch seems to implement range checks correctly,
> let me know if I am missing something.
> 
> One also notes that we currently link qemu with libpci
> which I think requires admin cap to work.
> However, in the future we might extend this to
> also support getting device fds over a unix socket
> from a higher priviledged process.
> 
> If or when this is done, we will have to be
> extra careful when passing
> device file descriptor to an unpriveledged qemu process if
> the BARs are less than full page in size: mapping
> such BAR will allow qemu access outside this BAR.
> 
> A possible solution to this problem
> if/when it arises would be adding yet another sysfs file
> for each resource, which would allow read/write but not
> mmap access, and perform range checks in the kernel.

Sounds like the best solution to this problem, yeah. Though we'd only need those for non-page-boundary BARs. So I guess the best would be to always export them from the kernel, but only use them when BAR & (PAGE_SIZE-1).

Either way, FWIW the device assignment stuff needs to be completely rewritten for qemu upstream anyways. So while it's good to collect ideas for now, let's not too put too much effort code-wise into the current code (unless it doesn't work).

Alex--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html