Re: SWIOTLB allocates unneeded 64 MB buffer in guests

Igor Mammedov <imammedo@xxxxxxxxxx> · Tue, 13 Sep 2016 11:47:22 +0200

On Mon, 12 Sep 2016 10:14:55 -0700
Benjamin Serebrin <serebrin@xxxxxxxxxx> wrote:

> Sure, SWIOTLB is linux-specific but general bounce buffering isn't.
> 
> The idea is that the ACPI bit promises that the guest will not ever
> need [SWIOTLB] bounce buffering.  That means either no hotplugging at
> all, or no hotplugging of high-mem-incapable devices.  If our VMM ever
> _adds_ a device to its catalog that's capable of hotplug but not
> highmem, we'll clear the ACPI bit, for example.  I'm happy to discuss
> and iterate over what promises are made by the ACPI bit if you'd like.
Implications of above is that you effectively push kernel's iommu=off
option up the stack where it would have to be configured to disable
hotplug (which is for example enabled by default in QEMU).
Also every existing/future device has to be modified to provide
highmem-cap property so that emulator/firmware could decide if
above ACPI table is necessary. It doable if an emulator generates
ACPI tables but close to impossible (via standard interfaces)
if it's firmware's job.

If hotplug is allowed by default and SWIOTLB ACPI table is generated
at boot if there aren't any low mem devices at boot,
then one'd need fix kernel to try dynamically allocate SWIOTLB and
fail high-mem-incapable device hotplug if it's unable to do so.

Trying to save 64Mb out of more than 4Gb memory at above cost seems
a little bit excessive.

Another question:
why don't run emulator with emulated IOMMU enabled? Then linux uses
real IOMMU dma_ops (intel/amd) and 64Mb for SWIOTLB are not wasted/freed
while keeping 32-bit devices operational?
Last time I tested it, it works just fine either for coldplug and
hotplug cases without need to mess with emulators nor any hardware
to provide SWIOTLB ACPI table.

> The problem with dynamic allocation of the bounce buffer is that the
> SWIOTLB code seems to demand contiguous low memory, and allocating
> contiguous memory after boot is never guaranteed because of
> fragmentation and subsequent pinning.  The original code seems to be
> motivated by this: it does an early allocation of a contiguous low mem
> and then a late deallocation if it determines that SWIOTLB is not
> needed.  I imagine they wanted to cover cases where some high
> mem-incapable device needed a contiguous target buffer because it had
> no (or insufficient) scatter/gather capability.
> 
> One could tie hot plug of a bounce-buffer-requiring virtual device to
> causing SWIOTLB allocation, and fail the device initialization if the
> required buffer couldn't be allocated.  I don't know of any new
> virtual devices that require that, though, as high-mem-incapability is
> hopefully only a vestige of very old virtual or real devices.  And the
> plumbing complexity for doing this is much higher than seems
> justified.
it possibly could be done in centralized manner in kernel when
device driver initializes DMA API, for example in
 dma_set_mask_and_coherent().
Even if it's done it would be regression if kernel's unable to
allocate bounce buffer on demand and device init fails were it were
working with preallocated SWIOTLB.

> 
> Thanks!
> Ben
> 
> On Mon, Sep 12, 2016 at 4:55 AM, Igor Mammedov <imammedo@xxxxxxxxxx> wrote:
> > On Sun, 28 Aug 2016 23:36:20 -0700
> > Benjamin Serebrin <serebrin@xxxxxxxxxx> wrote:
> >  
> >> Thanks, all,
> >>
> >> The general view from last week is to pursue an ACPI table that
> >> indicates that the SWIOTLB isn't needed.  I'll work with our local
> >> ACPI experts on table format.  
> > Isn't SWIOTLB linux specific impl. detail?
> > Suppose guest is started without SWIOTLB and later user hotplugs
> > a device that not capable to handle high mem, what's then?
> >
> > Wouldn't it be better to make SWIOTLB created/allocated
> > on demand in kernel (i.e. presence of devices that require it)
> > instead of making hardware(hypervisor) to provide some obscure
> > ACPI table quirk to fix kernel issue?
> >  
> >>
> >> For existing guests, we'll work on language suggesting kernel command
> >> line options (iommu=off) if people are concerned, and will look into
> >> doing the command line setting in our own provided images.
> >>
> >> On Thu, Aug 25, 2016 at 7:45 PM, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:  
> >> > 2016-08-26 9:16 GMT+08:00 Yang Zhang <yang.zhang.wz@xxxxxxxxx>:  
> >> >> On 2016/8/24 22:36, Benjamin Serebrin wrote:  
> >> >>>
> >> >>> iommu=off would kill the SWIOTLB as well, while swiotlb=1 consumes 1MB.
> >> >>>
> >> >>> However, maintaining guests' kernel commandlines is something we'd
> >> >>> like to stay away from if possible.  It's certainly a short-term  
> >> >>
> >> >>
> >> >> I don't quite understand why stay away from kernel command line. It provides
> >> >> more flexibility, allowing you to turn on/off it by yourself.  
> >> >
> >> > I agree with Benjamin, it will result in customers have to tune their
> >> > guest OSes kernel command line or we supply guest images w/ kernel
> >> > command line modification.
> >> >
> >> > Regards,
> >> > Wanpeng Li
> >> >  
> >> >>
> >> >>  
> >> >>> answer, or something individual customers can choose to do today.  
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> >  
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html