Re: [RFC PATCH] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache

Ulf Hansson <ulf.hansson@xxxxxxxxxx> · Tue, 14 Mar 2023 08:56:22 +0100

On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>
> On 10/03/23 19:06, Christian Löhle wrote:
> >
> >>>
> >>> I have benchmarked the FUA/Cache behavior a bit.
> >>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
> >>>
> >>> # call with
> >>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
> >>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
> >>> ./filesystembenchmark.sh; umount /mnt; done
> >>> mkfs.ext4 -F /dev/mmcblk2
> >>> mount /dev/mmcblk2 /mnt
> >>> for i in {1..3}
> >>> do
> >>> cp -r linux-6.2.2 /mnt/$i
> >>> done
> >>> for i in {1..3}
> >>> do
> >>> rm -r /mnt/$i
> >>> done
> >>> for i in {1..3}
> >>> do
> >>> cp -r linux-6.2.2 /mnt/$i
> >>> done
> >>>
> >>>
> >>> I found a couple of DUTs that I can link, I also tested one industrial card.
> >>>
> >>> DUT1: blue PCB Foresee eMMC
> >>> https://pine64.com/product/32gb-emmc-module/
> >>> DUT2: green PCB SiliconGo eMMC
> >>> Couldn't find that one online anymore unfortunately
> >>> DUT3: orange hardkernel PCB 8GB
> >>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
> >>> DUT4: orange hardkernel PCB white dot
> >>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
> >>> id-xu3.html
> >>> DUT5: Industrial card
> >>
> >> Thanks a lot for helping out with testing! Much appreciated!
> >
> > No problem, glad to be of help.
> >
> >>
> >>>
> >>>
> >>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
> >>>
> >>> DUT1:
> >>> Cache, no FUA:
> >>> 13:04.49
> >>> 13:13.82
> >>> 13:30.59
> >>> 13:28:13
> >>> 13:20:64
> >>> FUA:
> >>> 13:30.32
> >>> 13:36.26
> >>> 13:10.86
> >>> 13:32.52
> >>> 13:48.59
> >>>
> >>> DUT2:
> >>> FUA:
> >>> 8:11.24
> >>> 7:47.73
> >>> 7:48.00
> >>> 7:48.18
> >>> 7:47.38
> >>> Cache, no FUA:
> >>> 8:10.30
> >>> 7:48.97
> >>> 7:48.47
> >>> 7:47.93
> >>> 7:44.18
> >>>
> >>> DUT3:
> >>> Cache, no FUA:
> >>> 7:02.82
> >>> 6:58.94
> >>> 7:03.20
> >>> 7:00.27
> >>> 7:00.88
> >>> FUA:
> >>> 7:05.43
> >>> 7:03.44
> >>> 7:04.82
> >>> 7:03.26
> >>> 7:04.74
> >>>
> >>> DUT4:
> >>> FUA:
> >>> 7:23.92
> >>> 7:20.15
> >>> 7:20.52
> >>> 7:19.10
> >>> 7:20.71
> >>> Cache, no FUA:
> >>> 7:20.23
> >>> 7:20.48
> >>> 7:19.94
> >>> 7:18.90
> >>> 7:19.88
> >>
> >> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
> >>
> >> Do you agree?
> >
> > That is a good question, that's why I just posted the data without further comment from my side.
> > I was honestly expecting the difference to be much higher, given the original patch.
> > If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
> > If there are cards where the difference is much more significant then of course a quirk would be nicer.
> > On the other side I don't see why not and any improvement is a good one?
> >
> >>
> >>>
> >>> Cache, no FUA:
> >>> 7:19.36
> >>> 7:02.11
> >>> 7:01.53
> >>> 7:01.35
> >>> 7:00.37
> >>> Cache, no FUA CQE:
> >>> 7:17.55
> >>> 7:00.73
> >>> 6:59.25
> >>> 6:58.44
> >>> 6:58.60
> >>> FUA:
> >>> 7:15.10
> >>> 6:58.99
> >>> 6:58.94
> >>> 6:59.17
> >>> 6:60.00
> >>> FUA CQE:
> >>> 7:11.03
> >>> 6:58.04
> >>> 6:56.89
> >>> 6:56.43
> >>> 6:56:28
> >>>
> >>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
> >>
> >> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
> >
> > That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
> >
> >>
> >> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
> > It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
> >
> >>
> >> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
> > Sure, I'm also interested in Adrian's take on this.
>
> Testing an arbitrary system and looking only at individual I/Os,
> which may not be representative of any use-case, resulted in
> FUA always winning, see below.
>
> All values are approximate and in microseconds.
>
>                 With FUA                Without FUA
>
> With CQE        Reliable Write  350     Write   125
>                                         Flush   300
>                 Total           350             425
>
> Without CQE     Reliable Write  350     Write   125
>                 CMD13           100     CMD13   100
>                                         Flush   300
>                                         CMD13   100
>                 Total           450             625
>
> FYI the test I was doing was:
>
>   # cat test.sh
>         #!/bin/sh
>
>         echo "hi" > /mnt/mmc/hi.txt
>
>         sync
>
>
>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
>   # perf script --ns --deltatime
>
>
> The conclusion in this case would seem to be that CQE
> makes the case for removing FUA less bad.
>
> Perhaps CQE is more common in newer eMMCs which in turn
> have better FUA implementations.

Very interesting data - and thanks for helping out with tests!

A few observations and thoughts from the above.

1)
A "normal" use case would probably include additional writes (regular
writes) and I guess that could impact the flushing behavior. Maybe the
flushing becomes less heavy, if the device internally/occasionally
needs to flush its cache anyway? Or - maybe it doesn't matter at all,
because the reliable writes are triggering the cache to be flushed
too.

2)
Assuming that a reliable write is triggering the internal cache to be
flushed too, then we need less number of commands to be sent/acked to
the eMMC - compared to not using FUA. This means less protocol
overhead when using FUA - and perhaps that's what your tests is
actually telling us?

Kind regards
Uffe