On 10/03/23 19:06, Christian Löhle wrote: > >>> >>> I have benchmarked the FUA/Cache behavior a bit. >>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: >>> >>> # call with >>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M >>> of=/dev/mmcblk2; done; for loop in {1..5}; do time >>> ./filesystembenchmark.sh; umount /mnt; done >>> mkfs.ext4 -F /dev/mmcblk2 >>> mount /dev/mmcblk2 /mnt >>> for i in {1..3} >>> do >>> cp -r linux-6.2.2 /mnt/$i >>> done >>> for i in {1..3} >>> do >>> rm -r /mnt/$i >>> done >>> for i in {1..3} >>> do >>> cp -r linux-6.2.2 /mnt/$i >>> done >>> >>> >>> I found a couple of DUTs that I can link, I also tested one industrial card. >>> >>> DUT1: blue PCB Foresee eMMC >>> https://pine64.com/product/32gb-emmc-module/ >>> DUT2: green PCB SiliconGo eMMC >>> Couldn't find that one online anymore unfortunately >>> DUT3: orange hardkernel PCB 8GB >>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ >>> DUT4: orange hardkernel PCB white dot >>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro >>> id-xu3.html >>> DUT5: Industrial card >> >> Thanks a lot for helping out with testing! Much appreciated! > > No problem, glad to be of help. > >> >>> >>> >>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 >>> >>> DUT1: >>> Cache, no FUA: >>> 13:04.49 >>> 13:13.82 >>> 13:30.59 >>> 13:28:13 >>> 13:20:64 >>> FUA: >>> 13:30.32 >>> 13:36.26 >>> 13:10.86 >>> 13:32.52 >>> 13:48.59 >>> >>> DUT2: >>> FUA: >>> 8:11.24 >>> 7:47.73 >>> 7:48.00 >>> 7:48.18 >>> 7:47.38 >>> Cache, no FUA: >>> 8:10.30 >>> 7:48.97 >>> 7:48.47 >>> 7:47.93 >>> 7:44.18 >>> >>> DUT3: >>> Cache, no FUA: >>> 7:02.82 >>> 6:58.94 >>> 7:03.20 >>> 7:00.27 >>> 7:00.88 >>> FUA: >>> 7:05.43 >>> 7:03.44 >>> 7:04.82 >>> 7:03.26 >>> 7:04.74 >>> >>> DUT4: >>> FUA: >>> 7:23.92 >>> 7:20.15 >>> 7:20.52 >>> 7:19.10 >>> 7:20.71 >>> Cache, no FUA: >>> 7:20.23 >>> 7:20.48 >>> 7:19.94 >>> 7:18.90 >>> 7:19.88 >> >> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. >> >> Do you agree? > > That is a good question, that's why I just posted the data without further comment from my side. > I was honestly expecting the difference to be much higher, given the original patch. > If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. > If there are cards where the difference is much more significant then of course a quirk would be nicer. > On the other side I don't see why not and any improvement is a good one? > >> >>> >>> Cache, no FUA: >>> 7:19.36 >>> 7:02.11 >>> 7:01.53 >>> 7:01.35 >>> 7:00.37 >>> Cache, no FUA CQE: >>> 7:17.55 >>> 7:00.73 >>> 6:59.25 >>> 6:58.44 >>> 6:58.60 >>> FUA: >>> 7:15.10 >>> 6:58.99 >>> 6:58.94 >>> 6:59.17 >>> 6:60.00 >>> FUA CQE: >>> 7:11.03 >>> 6:58.04 >>> 6:56.89 >>> 6:56.43 >>> 6:56:28 >>> >>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. >> >> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? > > That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. > >> >> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). > It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. > >> >> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? > Sure, I'm also interested in Adrian's take on this. Testing an arbitrary system and looking only at individual I/Os, which may not be representative of any use-case, resulted in FUA always winning, see below. All values are approximate and in microseconds. With FUA Without FUA With CQE Reliable Write 350 Write 125 Flush 300 Total 350 425 Without CQE Reliable Write 350 Write 125 CMD13 100 CMD13 100 Flush 300 CMD13 100 Total 450 625 FYI the test I was doing was: # cat test.sh #!/bin/sh echo "hi" > /mnt/mmc/hi.txt sync # perf record --no-bpf-event -e mmc:* -a -- ./test.sh # perf script --ns --deltatime The conclusion in this case would seem to be that CQE makes the case for removing FUA less bad. Perhaps CQE is more common in newer eMMCs which in turn have better FUA implementations.