From: Petr Tesařík <petr@xxxxxxxxxxx> Sent: Monday, July 10, 2023 2:36 AM > > On Sat, 8 Jul 2023 15:18:32 +0000 > "Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> wrote: > > > From: Petr Tesařík <petr@xxxxxxxxxxx> Sent: Friday, July 7, 2023 3:22 AM > > > > > > On Fri, 7 Jul 2023 10:29:00 +0100 > > > Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote: > > > > > From: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, July 6, > > > 2023 1:07 AM > > > > > > > > > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote: > > > > > > > From: Petr Tesarik <petrtesarik@xxxxxxxxxxxxxxx> Sent: Tuesday, June 27, 2023 > > > > > > 2:54 AM > > > > > > > > > > > > > > > > Try to allocate a transient memory pool if no suitable slots can be found, > > > > > > > > except when allocating from a restricted pool. The transient pool is just > > > > > > > > enough big for this one bounce buffer. It is inserted into a per-device > > > > > > > > list of transient memory pools, and it is freed again when the bounce > > > > > > > > buffer is unmapped. > > > > > > > > > > > > > > > > Transient memory pools are kept in an RCU list. A memory barrier is > > > > > > > > required after adding a new entry, because any address within a transient > > > > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if > > > > > > > > it is passed to another CPU. > > > > > > > > > > > > > > > > Deletion does not require any synchronization beyond RCU ordering > > > > > > > > guarantees. After a buffer is unmapped, its physical addresses may no > > > > > > > > longer be passed to the DMA API, so the memory range of the corresponding > > > > > > > > stale entry in the RCU list never matches. If the memory range gets > > > > > > > > allocated again, then it happens only after a RCU quiescent state. > > > > > > > > > > > > > > > > Since bounce buffers can now be allocated from different pools, add a > > > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool > > > > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to > > > > > > > > an address. This function is now also used by is_swiotlb_buffer(), because > > > > > > > > a simple boundary check is no longer sufficient. > > > > > > > > > > > > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(), > > > > > > > > simplified and enhanced to use coherent memory pools if needed. > > > > > > > > > > > > > > > > Note that this is not the most efficient way to provide a bounce buffer, > > > > > > > > but when a DMA buffer can't be mapped, something may (and will) actually > > > > > > > > break. At that point it is better to make an allocation, even if it may be > > > > > > > > an expensive operation. > > > > > > > > > > > > > > I continue to think about swiotlb memory management from the standpoint > > > > > > > of CoCo VMs that may be quite large with high network and storage loads. > > > > > > > These VMs are often running mission-critical workloads that can't tolerate > > > > > > > a bounce buffer allocation failure. To prevent such failures, the swiotlb > > > > > > > memory size must be overly large, which wastes memory. > > > > > > > > > > > > If "mission critical workloads" are in a vm that allowes overcommit and > > > > > > no control over other vms in that same system, then you have worse > > > > > > problems, sorry. > > > > > > > > > > > > Just don't do that. > > > > > > > > > > > > > > > > No, the cases I'm concerned about don't involve memory overcommit. > > > > > > > > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb > > > > > code in the Linux guest allocates a configurable, but fixed, amount of guest > > > > > memory at boot time for this purpose. But it's hard to know how much > > > > > swiotlb bounce buffer memory will be needed to handle peak I/O loads. > > > > > This patch set does dynamic allocation of swiotlb bounce buffer memory, > > > > > which can help avoid needing to configure an overly large fixed size at boot. > > > > > > > > But, as you point out, memory allocation can fail at runtime, so how can > > > > you "guarantee" that this will work properly anymore if you are going to > > > > make it dynamic? > > > > > > In general, there is no guarantee, of course, because bounce buffers > > > may be requested from interrupt context. I believe Michael is looking > > > for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so > > > new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if > > > possible, and then there is no need to dip into the coherent pool. > > > > > > Well, I have deliberately removed all complexities from my v3 series, > > > but I have more WIP local topic branches in my local repo: > > > > > > - allow blocking allocations if possible > > > - allocate a new pool before existing pools are full > > > - free unused memory pools > > > > > > I can make a bigger series, or I can send another series as RFC if this > > > is desired. ATM I don't feel confident enough that my v3 series will be > > > accepted without major changes, so I haven't invested time into > > > finalizing the other topic branches. > > > > > > @Michael: If you know that my plan is to introduce blocking allocations > > > with a follow-up patch series, is the present approach acceptable? > > > > > > > Yes, I think the present approach is acceptable as a first step. But > > let me elaborate a bit on my thinking. > > > > I was originally wondering if it is possible for swiotlb_map() to detect > > whether it is called from a context that allows sleeping, without the use > > of SWIOTLB_MAY_SLEEP. This would get the benefits without having to > > explicitly update drivers to add the flag. But maybe that's too risky. > > This is a recurring topic and it has been discussed several times in > the mailing lists. If you ask me, the best answer is this one by Andrew > Morton, albeit a bit dated: > > https://lore.kernel.org/lkml/20080320201723.b87b3732.akpm@xxxxxxxxxxxxxxxxxxxx/ Thanks. That's useful context. > > > For > > the CoCo VM scenario that I'm most interested in, being a VM implicitly > > reduces the set of drivers that are being used, and so it's not that hard > > to add the flag in the key drivers that generate most of the bounce > > buffer traffic. > > Yes, that's my thinking as well. > > > Then I was thinking about a slightly different usage for the flag than what > > you implemented in v2 of the series. In the case where swiotlb_map() > > can't allocate slots because of the swiotlb pool being full (or mostly full), > > kick the background thread (if it is not already awake) to allocate a > > dynamic pool and grow the total size of the swiotlb. Then if > > SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you > > have implemented in this v3 of the series. But if SWIOTLB_MAY_SLEEP > > *is* set, swiotlb_map() should sleep until the background thread has > > completed the memory allocation and grown the size of the swiotlb. > > After the sleep, retry the slot allocation. Maybe what I'm describing > > is what you mean by "allow blocking allocations". :-) > > Not really, but I like the idea. After all, the only reason to have > transient pools is when something is needed immediately while the > background allocation is running. You can also take the thinking one step further: For bounce buffer requests that allow blocking, you could decide not to grow the pool above a specified maximum. If the max has been reached and space is not available, sleep until space is released by some other in-progress request. This could be a valid way to handle peak demand while capping the memory allocated to the bounce buffer pool. There would be a latency hit because of the waiting, but that could be a valid tradeoff for rare peaks. Of course, for requests that can't block, you'd still need to allocate a transient pool. Michael > > > This approach effectively throttles incoming swiotlb requests when space > > is exhausted, and gives the dynamic sizing mechanism a chance to catch > > up in an efficient fashion. Limiting transient pools to requests that can't > > sleep will reduce the likelihood of exhausting the coherent memory > > pools. And as you mentioned above, kicking the background thread at the > > 90% full mark (or some such heuristic) also helps the dynamic sizing > > mechanism keep up with demand. > > FWIW I did some testing, and my systems were not able to survive a > sudden I/O peak without transient pools, no matter how low I set the > threshold for kicking a background. OTOH I always tested with the > smallest possible SWIOTLB (256 KiB * rounded up number of CPUs, e.g. 16 > MiB on my VM with 48 CPUs). Other sizes may lead to different results. > > As a matter of fact, the size of the initial SWIOTLB memory pool and the > size(s) of additional pool(s) sound like interesting tunable parameters > that I haven't explored in depth yet. > > Petr T