On Wed, 3 Apr 2024 at 00:43, Sam Protsenko <semen.protsenko@xxxxxxxxxx> wrote: > > On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@xxxxxxxx> wrote: > > > > On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote: > > > Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1] > > > revealed the long living issue in dw_mmc.c driver, existing since the > > > time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc: > > > Add Synopsys DesignWare mmc host driver."), also making kernel boot > > > broken on platforms using dw_mmc driver with 16K or 64K pages enabled, > > > with this message in dmesg: > > > > > > mmcblk: probe of mmc0:0001 failed with error -22 > > > > > > That's happening because mmc_blk_probe() fails when it calls > > > blk_validate_limits() consequently, which returns the error due to > > > failed max_segment_size check in this code: > > > > > > /* > > > * The maximum segment size has an odd historic 64k default that > > > * drivers probably should override. Just like the I/O size we > > > * require drivers to at least handle a full page per segment. > > > */ > > > ... > > > if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) > > > return -EINVAL; > > > > > > In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always > > > sets .max_seg_size to 4 KiB: > > > > > > mmc->max_seg_size = 0x1000; > > > > > > The comment in the code above explains why it's incorrect. Arnd > > > suggested setting .max_seg_size to .max_req_size to fix it, which is > > > also what some other drivers are doing: > > > > > > $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \ > > > wc -l > > > 18 > > > > Nice summary! > > > > > This change is not only fixing the boot with 16K/64K pages, but also > > > leads to a better MMC performance. The linear write performance was > > > tested on E850-96 board (eMMC only), before commit [1] (where it's > > > possible to boot with 16K/64K pages without this fix, to be able to do > > > a comparison). It was tested with this command: > > > > > > # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync > > > > > > Test results are as follows: > > > > > > - 4K pages, .max_seg_size = 4 KiB: 94.2 MB/s > > > - 4K pages, .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s > > > - 16K pages, .max_seg_size = 4 KiB: 126 MB/s > > > - 16K pages, .max_seg_size = .max_req_size = 2 MiB: 128 MB/s > > > - 64K pages, .max_seg_size = 4 KiB: 138 MB/s > > > - 64K pages, .max_seg_size = .max_req_size = 8 MiB: 138 MB/s > > > > Thanks for sharing these results. From what I can see here, the > > performance changes significantly with the page size, but barely > > with the max_seg_size, so this does not have the effect I was > > hoping for. On a more positive note this likely means that we > > don't have to urgently backport your fix. > > > > This could mean that either there is not much coalescing across > > pages after all, or that the bottleneck is somewhere else. > > > > > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c > > > index 8e2d676b9239..cccd5633ff40 100644 > > > --- a/drivers/mmc/host/dw_mmc.c > > > +++ b/drivers/mmc/host/dw_mmc.c > > > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host) > > > if (host->use_dma == TRANS_MODE_IDMAC) { > > > mmc->max_segs = host->ring_size; > > > mmc->max_blk_size = 65535; > > > - mmc->max_seg_size = 0x1000; > > > - mmc->max_req_size = mmc->max_seg_size * host->ring_size; > > > + mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size; > > > + mmc->max_seg_size = mmc->max_req_size; > > > > The change looks good to me. > > > > I see that the host->ring_size depends on PAGE_SIZE as well: > > > > #define DESC_RING_BUF_SZ PAGE_SIZE > > host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr); > > host->sg_cpu = dmam_alloc_coherent(host->dev, > > DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL); > > > > I don't see any reason for the ring buffer size to be tied to > > PAGE_SIZE at all, it was probably picked as a reasonable > > default in the initial driver but isn't necessarily ideal. > > > > From what I can see, the number of 4KB elements in the > > ring can be as small as 128 (4KB pages, 64-bit addresses) > > or as big as 4096 (64KB pages, 32-bit addresses), which is > > quite a difference. If you are still motivated to drill > > down into this, could you try changing DESC_RING_BUF_SZ > > to a fixed size of either 4KB or 64KB and test again > > with the opposite page size, to see if that changes the > > throughput? > > > > Hi Arnd, > > Sorry for the late reply. I'm a bit of busy with something else right > now (trying to enable this same driver for Exynos850 in U-Boot, hehe), > I'll try to carve out some time later and tinker with > DESC_RING_BUF_SZ. But for now, can we just apply this patch as is? As > I understand, it's fixing quite a major issue (at least from what I > heard), so it would be nice to have it in -next and -stable. Does that > sound reasonable? Ideally, I would prefer it if you could try out Arnd's proposal, unless you think that's too far ahead for you, of course? The point is, that I would rather avoid us from messing around with these kinds of configurations. > > Thanks! > > > If a larger ring buffer gives us significantly better > > throughput, we may want to always use a higher number > > independent of page size. On the other hand, if the > > 64KB number (the 138MB/s) does not change with a smaller > > ring, we may as well reduce that in order to limit the > > maximum latency that is caused by a single I/O operation. > > > > Arnd Kind regards Uffe