Re: [PATCH] Enable non page boundary BAR device assignment

Alexander Graf <agraf@xxxxxxx> · Thu, 10 Dec 2009 11:31:54 +0100

On 10.12.2009, at 11:27, Michael S. Tsirkin wrote:

> On Thu, Dec 10, 2009 at 11:08:58AM +0100, Alexander Graf wrote:
>> 
>> On 10.12.2009, at 10:52, Alexander Graf wrote:
>> 
>>> 
>>> On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:
>>> 
>>>> On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
>>>>> On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
>>>>> 
>>>>>> While trying to get device passthrough working with an emulex hba,
>>>>>> kvm refused to pass it through because it has a BAR of 256 bytes:
>>>>>> 
>>>>>>      Region 0: Memory at d2100000 (64-bit, non-prefetchable) [size=4K]
>>>>>>      Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
>>>>>>      Region 4: I/O ports at b100 [size=256]
>>>>>> 
>>>>>> Since the page boundary is an arbitrary optimization to allow 1:1
>>>>>> mapping of physical to virtual addresses, we can still take the old
>>>>>> MMIO callback route.
>>>>>> 
>>>>>> So let's add a second code path that allows for size & 0xFFF != 0
>>>>>> sized regions by looping it through userspace.
>>>>> 
>>>>> That makes sense in general *but* the 4K-aligned check isn't just an
>>>>> optimization, it also has a security implication. Consider the
>>>>> theoretical case where has a multi-function device has BARs for two
>>>>> functions on the same page (within a 4K boundary), and each function
>>>>> is assigned to a different guest. With your current patch both guests
>>>>> will be able to write to each other's BARs. Another case is where a
>>>>> device has a bug and you must not write beyond the BAR or Bad Things
>>>>> Happen. With this patch an *unprivileged* guest could exploit that bug
>>>>> and make bad things happen.
>>>>> 
>>>>> This can be fixed if the slow userspace mmio path checks that all MMIO
>>>>> accesses by a guest fall within the portion of the page that is
>>>>> assigned to it.
>>>> 
>>>> This patch seems to implement range checks correctly,
>>>> let me know if I am missing something.
>>>> 
>>>> One also notes that we currently link qemu with libpci
>>>> which I think requires admin cap to work.
>>>> However, in the future we might extend this to
>>>> also support getting device fds over a unix socket
>>>> from a higher priviledged process.
>>>> 
>>>> If or when this is done, we will have to be
>>>> extra careful when passing
>>>> device file descriptor to an unpriveledged qemu process if
>>>> the BARs are less than full page in size: mapping
>>>> such BAR will allow qemu access outside this BAR.
>>>> 
>>>> A possible solution to this problem
>>>> if/when it arises would be adding yet another sysfs file
>>>> for each resource, which would allow read/write but not
>>>> mmap access, and perform range checks in the kernel.
>>> 
>>> Sounds like the best solution to this problem, yeah. Though we'd only need those for non-page-boundary BARs. So I guess the best would be to always export them from the kernel, but only use them when BAR & (PAGE_SIZE-1).
>> 
>> Hm, or add read/write fd functions that always do boundary checks to the existing interface and only allow mmap on size & PAGE_SIZE. Or only allow non-aligned mmap when the admin cap is present.
>> 
>> Alex
> 
> This might break existing applications.
> We don't want that.

Well currently you can't mmap the resource at all without at least r/w rights on the file, right?
But yeah, we'd probably change behavior that could break someone - sigh.

Alex--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html