On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@xxxxxxxx> wrote: > > On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote: > > Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1] > > revealed the long living issue in dw_mmc.c driver, existing since the > > time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc: > > Add Synopsys DesignWare mmc host driver."), also making kernel boot > > broken on platforms using dw_mmc driver with 16K or 64K pages enabled, > > with this message in dmesg: > > > > mmcblk: probe of mmc0:0001 failed with error -22 > > > > That's happening because mmc_blk_probe() fails when it calls > > blk_validate_limits() consequently, which returns the error due to > > failed max_segment_size check in this code: > > > > /* > > * The maximum segment size has an odd historic 64k default that > > * drivers probably should override. Just like the I/O size we > > * require drivers to at least handle a full page per segment. > > */ > > ... > > if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) > > return -EINVAL; > > > > In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always > > sets .max_seg_size to 4 KiB: > > > > mmc->max_seg_size = 0x1000; > > > > The comment in the code above explains why it's incorrect. Arnd > > suggested setting .max_seg_size to .max_req_size to fix it, which is > > also what some other drivers are doing: > > > > $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \ > > wc -l > > 18 > > Nice summary! > > > This change is not only fixing the boot with 16K/64K pages, but also > > leads to a better MMC performance. The linear write performance was > > tested on E850-96 board (eMMC only), before commit [1] (where it's > > possible to boot with 16K/64K pages without this fix, to be able to do > > a comparison). It was tested with this command: > > > > # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync > > > > Test results are as follows: > > > > - 4K pages, .max_seg_size = 4 KiB: 94.2 MB/s > > - 4K pages, .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s > > - 16K pages, .max_seg_size = 4 KiB: 126 MB/s > > - 16K pages, .max_seg_size = .max_req_size = 2 MiB: 128 MB/s > > - 64K pages, .max_seg_size = 4 KiB: 138 MB/s > > - 64K pages, .max_seg_size = .max_req_size = 8 MiB: 138 MB/s > > Thanks for sharing these results. From what I can see here, the > performance changes significantly with the page size, but barely > with the max_seg_size, so this does not have the effect I was > hoping for. On a more positive note this likely means that we > don't have to urgently backport your fix. > > This could mean that either there is not much coalescing across > pages after all, or that the bottleneck is somewhere else. > > > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c > > index 8e2d676b9239..cccd5633ff40 100644 > > --- a/drivers/mmc/host/dw_mmc.c > > +++ b/drivers/mmc/host/dw_mmc.c > > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host) > > if (host->use_dma == TRANS_MODE_IDMAC) { > > mmc->max_segs = host->ring_size; > > mmc->max_blk_size = 65535; > > - mmc->max_seg_size = 0x1000; > > - mmc->max_req_size = mmc->max_seg_size * host->ring_size; > > + mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size; > > + mmc->max_seg_size = mmc->max_req_size; > > The change looks good to me. > > I see that the host->ring_size depends on PAGE_SIZE as well: > > #define DESC_RING_BUF_SZ PAGE_SIZE > host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr); > host->sg_cpu = dmam_alloc_coherent(host->dev, > DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL); > > I don't see any reason for the ring buffer size to be tied to > PAGE_SIZE at all, it was probably picked as a reasonable > default in the initial driver but isn't necessarily ideal. > > From what I can see, the number of 4KB elements in the > ring can be as small as 128 (4KB pages, 64-bit addresses) > or as big as 4096 (64KB pages, 32-bit addresses), which is > quite a difference. If you are still motivated to drill > down into this, could you try changing DESC_RING_BUF_SZ > to a fixed size of either 4KB or 64KB and test again > with the opposite page size, to see if that changes the > throughput? > Hi Arnd, Sorry for the late reply. I'm a bit of busy with something else right now (trying to enable this same driver for Exynos850 in U-Boot, hehe), I'll try to carve out some time later and tinker with DESC_RING_BUF_SZ. But for now, can we just apply this patch as is? As I understand, it's fixing quite a major issue (at least from what I heard), so it would be nice to have it in -next and -stable. Does that sound reasonable? Thanks! > If a larger ring buffer gives us significantly better > throughput, we may want to always use a higher number > independent of page size. On the other hand, if the > 64KB number (the 138MB/s) does not change with a smaller > ring, we may as well reduce that in order to limit the > maximum latency that is caused by a single I/O operation. > > Arnd