Re: large page size virtio issues

Hollis Blanchard <hollisb@xxxxxxxxxx> · Wed, 05 Nov 2008 10:32:18 -0600

On Wed, 2008-11-05 at 08:06 -0600, Anthony Liguori wrote:
> Rusty Russell wrote:
> > On Wednesday 05 November 2008 09:14:20 Hollis Blanchard wrote:
> >   
> >> Hi Rusty, I'm using a patch that changes the Linux base page size to
> >> 64K. (This is actually pretty common in ppc64 world, but I happen to be
> >> trying it on ppc32.)
> >>
> >> I'm seeing a problem with virtio. I think at least part of it can be
> >> explained by qemu's TARGET_PAGE_BITS==12, and the guest's
> >> PAGE_SHIFT==16. The guest allocates the queue, then passes the pfn (pa
> >> >> PAGE_SHIFT) to the virtio backend (vp_find_vq()). The backend then
> >> calculates the pa as pfn << TARGET_PAGE_BITS.
> >>
> >> I have to run right now, but quickly changing qemu TARGET_PAGE_BITS to
> >> 16 got me a little further but still didn't work. Any thoughts?
> >>     
> >
> > I see Anthony hardwired page size into the queue activation ABI for 
> > virtio_pci.
> 
> So did you FWIW, virtio-balloon passes PFNs which are computed based on 
> PAGE_SHIFT.
> 
> >   I think that this should be an actual 4096 (or 12) rather than 
> > depending on guest page size:

I agree: it's simply a question of both sides of the interface agreeing
on the units used in the interface. As the interface is defined today,
both qemu and the guest virtio should use a fixed constant which is
neither (Linux) PAGE_SHIFT nor (qemu) TARGET_PAGE_BITS.

> So is the issue that PPC can support 4k or 16k pages, and the guest 
> happens to always use 16k pages?  Does the guest set any global flag 
> indicating it is using 16k pages?  Is this anyway we could detect this 
> in QEMU?

To elaborate a little, I'm using a patch to PowerPC 440 Linux that
allows you to configure the base page size at build time; choices are
4K, 16K, and 64K. (The hardware supports more sizes, and with other
patches or other operating systems qemu would need to worry about 256K,
1M, 16M, and 256M pages.)

The page size is set per MMU mapping, and of course it's ridiculous to
walk the TLB to see if all the page sizes are the same to "detect" the
condition. In fact, regardless of the base page size, the (Linux) kernel
is always mapped with 256M pages, and if you consider hugetlbfs the
situation is even more fluid.

> I don't much like the idea of globally hard coding it to 4k. I'd rather 
> make it architecture specific.

Making the units architecture-specific doesn't solve the problem at all
AFAICS. It doesn't even solve my original problem on PowerPC 440 since
the guest page size can vary.

AFAIK the only reason to use a PFN in this interface in the first place
is to allow for physical addresses >32 bits. A hardcoded shift of 12
gives you 44 bits of physical address space (16 TB). This actually isn't
very big today, so using an architecture-specific hardcoded 4K size will
become an issue anyways, *even on x86*.

Brainstorming backwards-compatible interface expansion possibilities:
     1. Rename the current interface to "4K_PFN", and add another, let's
        say "64K_PFN". Of course, if a guest with smaller pages uses the
        new interface, it must properly align its queue allocation.
     2. Rename the current interface to "4K_PFN". Use 64-bit writes to
        set VIRTIO_PCI_QUEUE_PFN. 32-bit architectures couldn't use
        this, which might be OK since practically speaking, I think
        32-bit architectures can address at most 36 bits of physical
        space. I also don't know what the semantics are of 64-bit PCI
        writes (if it's not allowed on physical hardware) -- it looks
        like Linux doesn't have an iowrite64, for example.
     3. Rename the current interface to "4K_PFN". Use multiple writes
        (high/low) to set VIRTIO_PCI_QUEUE_PFN. Not atomic. To simplify
        backend implementation, you could require that PFN_HIGH writes
        come before PFN_LOW.
     4. Use multiple writes (set page size, set PFN). SET_PAGE_SIZE must
        precede SET_PFN. Not atomic.
     5. Create a variable-sized interface (still 32-bit write), where
        the shift value is encoded in the value itself (I guess this is
        the FP mantissa+exponent approach). For example, the low 8 bits
        are the shift beyond 12, so a write of 0x10000004 would mean
        physical address 1<<(12+4).

These solutions would solve both problems: a) making "guest page size"
explicit, and b) addressing more than 16TB of physical memory in the
future. I think I like #3 or #4 the best.

Hardcoding the current interface to mean "4K pages" (and updating qemu
and Linux to match) would solve my problem, and the 16TB limit could be
addressed in the future as needed.

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html