Patch "nvme-pci: clamp max_hw_sectors based on DMA optimized limitation" has been added to the 6.3-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Mon, 5 Jun 2023 07:32:46 -0400

This is a note to let you know that I've just added the patch titled

    nvme-pci: clamp max_hw_sectors based on DMA optimized limitation

to the 6.3-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     nvme-pci-clamp-max_hw_sectors-based-on-dma-optimized.patch
and it can be found in the queue-6.3 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 2e72417634ad93a7d39dfb78bb2fd94ba6ae13f0
Author: Adrian Huang <ahuang12@xxxxxxxxxx>
Date:   Fri Apr 21 16:08:00 2023 +0800

    nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
    
    [ Upstream commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd ]
    
    When running the fio test on a 448-core AMD server + a NVME disk,
    a soft lockup or a hard lockup call trace is shown:
    
    [soft lockup]
    watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
    RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
    ...
    Call Trace:
     <IRQ>
     fq_flush_timeout+0x7d/0xd0
     ? __pfx_fq_flush_timeout+0x10/0x10
     call_timer_fn+0x2e/0x150
     run_timer_softirq+0x48a/0x560
     ? __pfx_fq_flush_timeout+0x10/0x10
     ? clockevents_program_event+0xaf/0x130
     __do_softirq+0xf1/0x335
     irq_exit_rcu+0x9f/0xd0
     sysvec_apic_timer_interrupt+0xb4/0xd0
     </IRQ>
     <TASK>
     asm_sysvec_apic_timer_interrupt+0x1f/0x30
    ...
    
    Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:
    
                   |  fq_flush_timeout() {
                   |    fq_ring_free() {
                   |      put_pages_list() {
       0.170 us    |        free_unref_page_list();
       0.810 us    |      }
                   |      free_iova_fast() {
                   |        free_iova() {
     * 85622.66 us |          _raw_spin_lock_irqsave();
       2.860 us    |          remove_iova();
       0.600 us    |          _raw_spin_unlock_irqrestore();
       0.470 us    |          lock_info_report();
       2.420 us    |          free_iova_mem.part.0();
     * 85638.27 us |        }
     * 85638.84 us |      }
                   |      put_pages_list() {
       0.230 us    |        free_unref_page_list();
       0.470 us    |      }
       ...            ...
     $ 31017069 us |  }
    
    Most of cores are under lock contention for acquiring iova_rbtree_lock due
    to the iova flush queue mechanism.
    
    [hard lockup]
    NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
    RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330
    
    Call Trace:
     <IRQ>
     _raw_spin_lock_irqsave+0x4f/0x60
     free_iova+0x27/0xd0
     free_iova_fast+0x4d/0x1d0
     fq_ring_free+0x9b/0x150
     iommu_dma_free_iova+0xb4/0x2e0
     __iommu_dma_unmap+0x10b/0x140
     iommu_dma_unmap_sg+0x90/0x110
     dma_unmap_sg_attrs+0x4a/0x50
     nvme_unmap_data+0x5d/0x120 [nvme]
     nvme_pci_complete_batch+0x77/0xc0 [nvme]
     nvme_irq+0x2ee/0x350 [nvme]
     ? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
     __handle_irq_event_percpu+0x53/0x1a0
     handle_irq_event_percpu+0x19/0x60
     handle_irq_event+0x3d/0x60
     handle_edge_irq+0xb3/0x210
     __common_interrupt+0x7f/0x150
     common_interrupt+0xc5/0xf0
     </IRQ>
     <TASK>
     asm_common_interrupt+0x2b/0x40
    ...
    
    ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
    cores are under lock contention for acquiring iova_rbtree_lock due
    to the iova flush queue mechanism.
    
    [Root Cause]
    The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
    is 4096kb, which streaming DMA mappings cannot benefit from the
    scalable IOVA mechanism introduced by the commit 9257b4a206fc
    ("iommu/iova: introduce per-cpu caching to iova allocation") if
    the length is greater than 128kb.
    
    To fix the lock contention issue, clamp max_hw_sectors based on
    DMA optimized limitation in order to leverage scalable IOVA mechanism.
    
    Note: The issue does not happen with another NVME disk (mdts = 5
    and max_hw_sectors_kb = 128)
    
    [1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4
    
    Suggested-by: Keith Busch <kbusch@xxxxxxxxxx>
    Reported-and-tested-by: Jiwei Sun <sunjw10@xxxxxxxxxx>
    Signed-off-by: Adrian Huang <ahuang12@xxxxxxxxxx>
    Reviewed-by: Keith Busch <kbusch@xxxxxxxxxx>
    Signed-off-by: Christoph Hellwig <hch@xxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7772c0194d5a..a389f1ea0b151 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2960,7 +2960,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	 * over a single page.
 	 */
 	dev->ctrl.max_hw_sectors = min_t(u32,
-		NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
+		NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
 	dev->ctrl.max_segments = NVME_MAX_SEGS;
 
 	/*