On Sat, Aug 20, 2022 at 12:42:50AM -0700, Dongli Zhang wrote: > Hello, > > I used to send out RFC v1 to introduce an extra io_tlb_mem (created with > SWIOTLB_ANY) in addition to the default io_tlb_mem (32-bit). The > dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem, > depending on dma mask. However, that is not good for setting > dev->dma_io_tlb_mem at swiotlb layer transparently as suggested by > Christoph Hellwig. > > https://lore.kernel.org/all/20220609005553.30954-1-dongli.zhang@xxxxxxxxxx/ > > Therefore, this is another RFC v2 implementation following a different > direction. The core ideas are: > > 1. The swiotlb is splited into two zones, io_tlb_mem->zone[0] (32-bit) and > io_tlb_mem->zone[1] (64-bit). > > struct io_tlb_mem { > struct io_tlb_zone zone[SWIOTLB_NR]; > struct dentry *debugfs; > bool late_alloc; > bool force_bounce; > bool for_alloc; > bool has_extra; > }; > > struct io_tlb_zone { > phys_addr_t start; > phys_addr_t end; > void *vaddr; > unsigned long nslabs; > unsigned long used; > unsigned int nareas; > unsigned int area_nslabs; > struct io_tlb_area *areas; > struct io_tlb_slot *slots; > }; > > 2. By default, only io_tlb_mem->zone[0] is available. The > io_tlb_mem->zone[1] is allocated conditionally if: > > - the "swiotlb=" is configured to allocate extra buffer, and > - the SWIOTLB_EXTRA is set in the flag (this is to make sure arch(s) other > than x86/sev/xen will not enable it until it is fully tested by each > arch, e.g., mips/powerpc). Currently it is enabled for x86 and xen. > > 3. During swiotlb map, whether zone[0] (32-bit) or zone[1] (64-bit > SWIOTLB_ANY) > is used depends on min_not_zero(*dev->dma_mask, dev->bus_dma_limit). > > To test the RFC v2, here is the QEMU command line. > > qemu-system-x86_64 -smp 8 -m 32G -enable-kvm -vnc :5 -hda disk.img \ > -kernel path-to-linux/arch/x86_64/boot/bzImage \ > -append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 swiotlb=32768,4194304,force" \ > -net nic -net user,hostfwd=tcp::5025-:22 \ > -device nvme,drive=nvme01,serial=helloworld -drive file=test.qcow2,if=none,id=nvme01 \ > -serial stdio > > There is below in syslog. The extra 8GB buffer is allocated. > > [ 0.152251] software IO TLB: area num 8. > ... ... > [ 3.706088] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > [ 3.707334] software IO TLB: mapped default [mem 0x00000000bbfd7000-0x00000000bffd7000] (64MB) > [ 3.708585] software IO TLB: mapped extra [mem 0x000000061cc00000-0x000000081cc00000] (8192MB) > > After the FIO is triggered over NVMe, the 64-bit buffer is used. > > $ cat /sys/kernel/debug/swiotlb/io_tlb_nslabs_extra > 4194304 > $ cat /sys/kernel/debug/swiotlb/io_tlb_used_extra > 327552 > > Would you mind helping if this is the right direction to go? > > Thank you very much! > > Cc: Konrad Wilk <konrad.wilk@xxxxxxxxxx> > Cc: Joe Jin <joe.jin@xxxxxxxxxx> > Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx> > --- > arch/arm/xen/mm.c | 2 +- > arch/mips/pci/pci-octeon.c | 5 +- > arch/x86/include/asm/xen/swiotlb-xen.h | 2 +- > arch/x86/kernel/pci-dma.c | 6 +- > drivers/xen/swiotlb-xen.c | 18 +- > include/linux/swiotlb.h | 73 +++-- > diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c > index 3d826c0..4edfa42 100644 > --- a/arch/arm/xen/mm.c > +++ b/arch/arm/xen/mm.c > @@ -125,7 +125,7 @@ static int __init xen_mm_init(void) > return 0; > > /* we can work with the default swiotlb */ > - if (!io_tlb_default_mem.nslabs) { > + if (!io_tlb_default_mem.zone[SWIOTLB_DF].nslabs) { > rc = swiotlb_init_late(swiotlb_size_or_default(), > xen_swiotlb_gfp(), NULL); > if (rc < 0) First thing we need before doing anything about multiple default pools is to get all the knowledge of details hidden inside swiotlb.c. For swiotlb_init_late that seems easy as we can just move the check into it. > diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c > index e457a18..0bf0859 100644 > --- a/arch/mips/pci/pci-octeon.c > +++ b/arch/mips/pci/pci-octeon.c > @@ -654,6 +654,9 @@ static int __init octeon_pci_setup(void) > octeon_pci_mem_resource.end = > octeon_pci_mem_resource.start + (1ul << 30); > } else { > + struct io_tlb_mem *mem = &io_tlb_default_mem; > + struct io_tlb_zone *zone = &mem->zone[SWIOTLB_DF]; > + > /* Remap the Octeon BAR 0 to map 128MB-(128MB+4KB) */ > octeon_npi_write32(CVMX_NPI_PCI_CFG04, 128ul << 20); > octeon_npi_write32(CVMX_NPI_PCI_CFG05, 0); > @@ -664,7 +667,7 @@ static int __init octeon_pci_setup(void) > > /* BAR1 movable regions contiguous to cover the swiotlb */ > octeon_bar1_pci_phys = > - io_tlb_default_mem.start & ~((1ull << 22) - 1); > + zone->start & ~((1ull << 22) - 1); But we'll need to do something about this mess. I'll need help from the octeon maintainer on it. > - x86_swiotlb_flags |= SWIOTLB_ANY; > + x86_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_EXTRA; I don't think this is how it is suppoed to be. SWIOTLB_ANY already says give me a pool with no addressing constrains. We don't need two pools without that. EXTRA is also not exactly a very helpful name here. > #ifdef CONFIG_X86 > -int xen_swiotlb_fixup(void *buf, unsigned long nslabs) > +int xen_swiotlb_fixup(void *buf, unsigned long nslabs, unsigned int flags) > { > int rc; > unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT); > unsigned int i, dma_bits = order + PAGE_SHIFT; > dma_addr_t dma_handle; > phys_addr_t p = virt_to_phys(buf); > + unsigned int max_dma_bits = 32; I think live will be a lot simple if the addressing bits are passed to this function, and not some kind of flags. > +#define SWIOTLB_DF 0 > +#define SWIOTLB_EX 1 > +#define SWIOTLB_NR 2 These names are not very descriptive.