Re: SWIOTLB allocates unneeded 64 MB buffer in guests

Benjamin Serebrin <serebrin@xxxxxxxxxx> · Mon, 19 Sep 2016 09:32:35 -0700

On Tue, Sep 13, 2016 at 2:47 AM, Igor Mammedov <imammedo@xxxxxxxxxx> wrote:
> On Mon, 12 Sep 2016 10:14:55 -0700
> Benjamin Serebrin <serebrin@xxxxxxxxxx> wrote:
>
>> Sure, SWIOTLB is linux-specific but general bounce buffering isn't.
>>
>> The idea is that the ACPI bit promises that the guest will not ever
>> need [SWIOTLB] bounce buffering.  That means either no hotplugging at
>> all, or no hotplugging of high-mem-incapable devices.  If our VMM ever
>> _adds_ a device to its catalog that's capable of hotplug but not
>> highmem, we'll clear the ACPI bit, for example.  I'm happy to discuss
>> and iterate over what promises are made by the ACPI bit if you'd like.
> Implications of above is that you effectively push kernel's iommu=off
> option up the stack where it would have to be configured to disable
> hotplug (which is for example enabled by default in QEMU).
> Also every existing/future device has to be modified to provide
> highmem-cap property so that emulator/firmware could decide if
> above ACPI table is necessary. It doable if an emulator generates
> ACPI tables but close to impossible (via standard interfaces)
> if it's firmware's job.
>
> If hotplug is allowed by default and SWIOTLB ACPI table is generated
> at boot if there aren't any low mem devices at boot,
> then one'd need fix kernel to try dynamically allocate SWIOTLB and
> fail high-mem-incapable device hotplug if it's unable to do so.
>
> Trying to save 64Mb out of more than 4Gb memory at above cost seems
> a little bit excessive.

I don't recommend such complexity; I was proposing a hint bit in ACPI
as a simple promise from the hypervisor.

64MB is 1.5% of a 4GB machine.  We wanted an easy way to give it back
to the guest.

>
> Another question:
> why don't run emulator with emulated IOMMU enabled? Then linux uses
> real IOMMU dma_ops (intel/amd) and 64Mb for SWIOTLB are not wasted/freed
> while keeping 32-bit devices operational?
> Last time I tested it, it works just fine either for coldplug and
> hotplug cases without need to mess with emulators nor any hardware
> to provide SWIOTLB ACPI table.
>

IOMMU comes with its own overheads; for example, until kernel v4.7,
where the speedup in the intel IOMMU ops was merged, guest
intel-iommu.c code has significant performance scalability issues.  I
would be more willing to try to get distros to backport a simple
no-SWIOTLB change than the fairly-invasive IOMMU optimizations.  We'll
be living with many pre-4.7 guests for quite a while.

>
>> The problem with dynamic allocation of the bounce buffer is that the
>> SWIOTLB code seems to demand contiguous low memory, and allocating
>> contiguous memory after boot is never guaranteed because of
>> fragmentation and subsequent pinning.  The original code seems to be
>> motivated by this: it does an early allocation of a contiguous low mem
>> and then a late deallocation if it determines that SWIOTLB is not
>> needed.  I imagine they wanted to cover cases where some high
>> mem-incapable device needed a contiguous target buffer because it had
>> no (or insufficient) scatter/gather capability.
>>
>> One could tie hot plug of a bounce-buffer-requiring virtual device to
>> causing SWIOTLB allocation, and fail the device initialization if the
>> required buffer couldn't be allocated.  I don't know of any new
>> virtual devices that require that, though, as high-mem-incapability is
>> hopefully only a vestige of very old virtual or real devices.  And the
>> plumbing complexity for doing this is much higher than seems
>> justified.
> it possibly could be done in centralized manner in kernel when
> device driver initializes DMA API, for example in
>  dma_set_mask_and_coherent().
> Even if it's done it would be regression if kernel's unable to
> allocate bounce buffer on demand and device init fails were it were
> working with preallocated SWIOTLB.
>
>
>>
>> Thanks!
>> Ben
>>
>> On Mon, Sep 12, 2016 at 4:55 AM, Igor Mammedov <imammedo@xxxxxxxxxx> wrote:
>> > On Sun, 28 Aug 2016 23:36:20 -0700
>> > Benjamin Serebrin <serebrin@xxxxxxxxxx> wrote:
>> >
>> >> Thanks, all,
>> >>
>> >> The general view from last week is to pursue an ACPI table that
>> >> indicates that the SWIOTLB isn't needed.  I'll work with our local
>> >> ACPI experts on table format.
>> > Isn't SWIOTLB linux specific impl. detail?
>> > Suppose guest is started without SWIOTLB and later user hotplugs
>> > a device that not capable to handle high mem, what's then?
>> >
>> > Wouldn't it be better to make SWIOTLB created/allocated
>> > on demand in kernel (i.e. presence of devices that require it)
>> > instead of making hardware(hypervisor) to provide some obscure
>> > ACPI table quirk to fix kernel issue?
>> >
>> >>
>> >> For existing guests, we'll work on language suggesting kernel command
>> >> line options (iommu=off) if people are concerned, and will look into
>> >> doing the command line setting in our own provided images.
>> >>
>> >> On Thu, Aug 25, 2016 at 7:45 PM, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
>> >> > 2016-08-26 9:16 GMT+08:00 Yang Zhang <yang.zhang.wz@xxxxxxxxx>:
>> >> >> On 2016/8/24 22:36, Benjamin Serebrin wrote:
>> >> >>>
>> >> >>> iommu=off would kill the SWIOTLB as well, while swiotlb=1 consumes 1MB.
>> >> >>>
>> >> >>> However, maintaining guests' kernel commandlines is something we'd
>> >> >>> like to stay away from if possible.  It's certainly a short-term
>> >> >>
>> >> >>
>> >> >> I don't quite understand why stay away from kernel command line. It provides
>> >> >> more flexibility, allowing you to turn on/off it by yourself.
>> >> >
>> >> > I agree with Benjamin, it will result in customers have to tune their
>> >> > guest OSes kernel command line or we supply guest images w/ kernel
>> >> > command line modification.
>> >> >
>> >> > Regards,
>> >> > Wanpeng Li
>> >> >
>> >> >>
>> >> >>
>> >> >>> answer, or something individual customers can choose to do today.
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html