Re: [PATCH] mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K

Ulf Hansson <ulf.hansson@xxxxxxxxxx> · Thu, 4 Apr 2024 11:28:55 +0200

On Wed, 3 Apr 2024 at 00:43, Sam Protsenko <semen.protsenko@xxxxxxxxxx> wrote:
>
> On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@xxxxxxxx> wrote:
> >
> > On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote:
> > > Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> > > revealed the long living issue in dw_mmc.c driver, existing since the
> > > time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> > > Add Synopsys DesignWare mmc host driver."), also making kernel boot
> > > broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> > > with this message in dmesg:
> > >
> > >     mmcblk: probe of mmc0:0001 failed with error -22
> > >
> > > That's happening because mmc_blk_probe() fails when it calls
> > > blk_validate_limits() consequently, which returns the error due to
> > > failed max_segment_size check in this code:
> > >
> > >     /*
> > >      * The maximum segment size has an odd historic 64k default that
> > >      * drivers probably should override.  Just like the I/O size we
> > >      * require drivers to at least handle a full page per segment.
> > >      */
> > >     ...
> > >     if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> > >         return -EINVAL;
> > >
> > > In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> > > sets .max_seg_size to 4 KiB:
> > >
> > >     mmc->max_seg_size = 0x1000;
> > >
> > > The comment in the code above explains why it's incorrect. Arnd
> > > suggested setting .max_seg_size to .max_req_size to fix it, which is
> > > also what some other drivers are doing:
> > >
> > >    $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
> > >      wc -l
> > >    18
> >
> > Nice summary!
> >
> > > This change is not only fixing the boot with 16K/64K pages, but also
> > > leads to a better MMC performance. The linear write performance was
> > > tested on E850-96 board (eMMC only), before commit [1] (where it's
> > > possible to boot with 16K/64K pages without this fix, to be able to do
> > > a comparison). It was tested with this command:
> > >
> > >     # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
> > >
> > > Test results are as follows:
> > >
> > >   - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
> > >   - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
> > >   - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
> > >   - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
> > >   - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
> > >   - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s
> >
> > Thanks for sharing these results. From what I can see here, the
> > performance changes significantly with the page size, but barely
> > with the max_seg_size, so this does not have the effect I was
> > hoping for. On a more positive note this likely means that we
> > don't have to urgently backport your fix.
> >
> > This could mean that either there is not much coalescing across
> > pages after all, or that the bottleneck is somewhere else.
> >
> > > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> > > index 8e2d676b9239..cccd5633ff40 100644
> > > --- a/drivers/mmc/host/dw_mmc.c
> > > +++ b/drivers/mmc/host/dw_mmc.c
> > > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
> > >       if (host->use_dma == TRANS_MODE_IDMAC) {
> > >               mmc->max_segs = host->ring_size;
> > >               mmc->max_blk_size = 65535;
> > > -             mmc->max_seg_size = 0x1000;
> > > -             mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> > > +             mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> > > +             mmc->max_seg_size = mmc->max_req_size;
> >
> > The change looks good to me.
> >
> > I see that the host->ring_size depends on PAGE_SIZE as well:
> >
> > #define DESC_RING_BUF_SZ        PAGE_SIZE
> > host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
> > host->sg_cpu = dmam_alloc_coherent(host->dev,
> >                DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
> >
> > I don't see any reason for the ring buffer size to be tied to
> > PAGE_SIZE at all, it was probably picked as a reasonable
> > default in the initial driver but isn't necessarily ideal.
> >
> > From what I can see, the number of 4KB elements in the
> > ring can be as small as 128 (4KB pages, 64-bit addresses)
> > or as big as 4096 (64KB pages, 32-bit addresses), which is
> > quite a difference. If you are still motivated to drill
> > down into this, could you try changing DESC_RING_BUF_SZ
> > to a fixed size of either 4KB or 64KB and test again
> > with the opposite page size, to see if that changes the
> > throughput?
> >
>
> Hi Arnd,
>
> Sorry for the late reply. I'm a bit of busy with something else right
> now (trying to enable this same driver for Exynos850 in U-Boot, hehe),
> I'll try to carve out some time later and tinker with
> DESC_RING_BUF_SZ. But for now, can we just apply this patch as is? As
> I understand, it's fixing quite a major issue (at least from what I
> heard), so it would be nice to have it in -next and -stable. Does that
> sound reasonable?

Ideally, I would prefer it if you could try out Arnd's proposal,
unless you think that's too far ahead for you, of course? The point
is, that I would rather avoid us from messing around with these kinds
of configurations.

>
> Thanks!
>
> > If a larger ring buffer gives us significantly better
> > throughput, we may want to always use a higher number
> > independent of page size. On the other hand, if the
> > 64KB number (the 138MB/s) does not change with a smaller
> > ring, we may as well reduce that in order to limit the
> > maximum latency that is caused by a single I/O operation.
> >
> >      Arnd

Kind regards
Uffe