On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: > > On 10/03/23 19:06, Christian Löhle wrote: > > > >>> > >>> I have benchmarked the FUA/Cache behavior a bit. > >>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: > >>> > >>> # call with > >>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M > >>> of=/dev/mmcblk2; done; for loop in {1..5}; do time > >>> ./filesystembenchmark.sh; umount /mnt; done > >>> mkfs.ext4 -F /dev/mmcblk2 > >>> mount /dev/mmcblk2 /mnt > >>> for i in {1..3} > >>> do > >>> cp -r linux-6.2.2 /mnt/$i > >>> done > >>> for i in {1..3} > >>> do > >>> rm -r /mnt/$i > >>> done > >>> for i in {1..3} > >>> do > >>> cp -r linux-6.2.2 /mnt/$i > >>> done > >>> > >>> > >>> I found a couple of DUTs that I can link, I also tested one industrial card. > >>> > >>> DUT1: blue PCB Foresee eMMC > >>> https://pine64.com/product/32gb-emmc-module/ > >>> DUT2: green PCB SiliconGo eMMC > >>> Couldn't find that one online anymore unfortunately > >>> DUT3: orange hardkernel PCB 8GB > >>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ > >>> DUT4: orange hardkernel PCB white dot > >>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro > >>> id-xu3.html > >>> DUT5: Industrial card > >> > >> Thanks a lot for helping out with testing! Much appreciated! > > > > No problem, glad to be of help. > > > >> > >>> > >>> > >>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 > >>> > >>> DUT1: > >>> Cache, no FUA: > >>> 13:04.49 > >>> 13:13.82 > >>> 13:30.59 > >>> 13:28:13 > >>> 13:20:64 > >>> FUA: > >>> 13:30.32 > >>> 13:36.26 > >>> 13:10.86 > >>> 13:32.52 > >>> 13:48.59 > >>> > >>> DUT2: > >>> FUA: > >>> 8:11.24 > >>> 7:47.73 > >>> 7:48.00 > >>> 7:48.18 > >>> 7:47.38 > >>> Cache, no FUA: > >>> 8:10.30 > >>> 7:48.97 > >>> 7:48.47 > >>> 7:47.93 > >>> 7:44.18 > >>> > >>> DUT3: > >>> Cache, no FUA: > >>> 7:02.82 > >>> 6:58.94 > >>> 7:03.20 > >>> 7:00.27 > >>> 7:00.88 > >>> FUA: > >>> 7:05.43 > >>> 7:03.44 > >>> 7:04.82 > >>> 7:03.26 > >>> 7:04.74 > >>> > >>> DUT4: > >>> FUA: > >>> 7:23.92 > >>> 7:20.15 > >>> 7:20.52 > >>> 7:19.10 > >>> 7:20.71 > >>> Cache, no FUA: > >>> 7:20.23 > >>> 7:20.48 > >>> 7:19.94 > >>> 7:18.90 > >>> 7:19.88 > >> > >> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. > >> > >> Do you agree? > > > > That is a good question, that's why I just posted the data without further comment from my side. > > I was honestly expecting the difference to be much higher, given the original patch. > > If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. > > If there are cards where the difference is much more significant then of course a quirk would be nicer. > > On the other side I don't see why not and any improvement is a good one? > > > >> > >>> > >>> Cache, no FUA: > >>> 7:19.36 > >>> 7:02.11 > >>> 7:01.53 > >>> 7:01.35 > >>> 7:00.37 > >>> Cache, no FUA CQE: > >>> 7:17.55 > >>> 7:00.73 > >>> 6:59.25 > >>> 6:58.44 > >>> 6:58.60 > >>> FUA: > >>> 7:15.10 > >>> 6:58.99 > >>> 6:58.94 > >>> 6:59.17 > >>> 6:60.00 > >>> FUA CQE: > >>> 7:11.03 > >>> 6:58.04 > >>> 6:56.89 > >>> 6:56.43 > >>> 6:56:28 > >>> > >>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. > >> > >> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? > > > > That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. > > > >> > >> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). > > It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. > > > >> > >> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? > > Sure, I'm also interested in Adrian's take on this. > > Testing an arbitrary system and looking only at individual I/Os, > which may not be representative of any use-case, resulted in > FUA always winning, see below. > > All values are approximate and in microseconds. > > With FUA Without FUA > > With CQE Reliable Write 350 Write 125 > Flush 300 > Total 350 425 > > Without CQE Reliable Write 350 Write 125 > CMD13 100 CMD13 100 > Flush 300 > CMD13 100 > Total 450 625 > > FYI the test I was doing was: > > # cat test.sh > #!/bin/sh > > echo "hi" > /mnt/mmc/hi.txt > > sync > > > # perf record --no-bpf-event -e mmc:* -a -- ./test.sh > # perf script --ns --deltatime > > > The conclusion in this case would seem to be that CQE > makes the case for removing FUA less bad. > > Perhaps CQE is more common in newer eMMCs which in turn > have better FUA implementations. Very interesting data - and thanks for helping out with tests! A few observations and thoughts from the above. 1) A "normal" use case would probably include additional writes (regular writes) and I guess that could impact the flushing behavior. Maybe the flushing becomes less heavy, if the device internally/occasionally needs to flush its cache anyway? Or - maybe it doesn't matter at all, because the reliable writes are triggering the cache to be flushed too. 2) Assuming that a reliable write is triggering the internal cache to be flushed too, then we need less number of commands to be sent/acked to the eMMC - compared to not using FUA. This means less protocol overhead when using FUA - and perhaps that's what your tests is actually telling us? Kind regards Uffe