On 16/03/23 14:12, Ulf Hansson wrote: > On Tue, 14 Mar 2023 at 09:58, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >> >> On 14/03/23 09:56, Ulf Hansson wrote: >>> On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >>>> >>>> On 10/03/23 19:06, Christian Löhle wrote: >>>>> >>>>>>> >>>>>>> I have benchmarked the FUA/Cache behavior a bit. >>>>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used: >>>>>>> >>>>>>> # call with >>>>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M >>>>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time >>>>>>> ./filesystembenchmark.sh; umount /mnt; done >>>>>>> mkfs.ext4 -F /dev/mmcblk2 >>>>>>> mount /dev/mmcblk2 /mnt >>>>>>> for i in {1..3} >>>>>>> do >>>>>>> cp -r linux-6.2.2 /mnt/$i >>>>>>> done >>>>>>> for i in {1..3} >>>>>>> do >>>>>>> rm -r /mnt/$i >>>>>>> done >>>>>>> for i in {1..3} >>>>>>> do >>>>>>> cp -r linux-6.2.2 /mnt/$i >>>>>>> done >>>>>>> >>>>>>> >>>>>>> I found a couple of DUTs that I can link, I also tested one industrial card. >>>>>>> >>>>>>> DUT1: blue PCB Foresee eMMC >>>>>>> https://pine64.com/product/32gb-emmc-module/ >>>>>>> DUT2: green PCB SiliconGo eMMC >>>>>>> Couldn't find that one online anymore unfortunately >>>>>>> DUT3: orange hardkernel PCB 8GB >>>>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/ >>>>>>> DUT4: orange hardkernel PCB white dot >>>>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro >>>>>>> id-xu3.html >>>>>>> DUT5: Industrial card >>>>>> >>>>>> Thanks a lot for helping out with testing! Much appreciated! >>>>> >>>>> No problem, glad to be of help. >>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5 >>>>>>> >>>>>>> DUT1: >>>>>>> Cache, no FUA: >>>>>>> 13:04.49 >>>>>>> 13:13.82 >>>>>>> 13:30.59 >>>>>>> 13:28:13 >>>>>>> 13:20:64 >>>>>>> FUA: >>>>>>> 13:30.32 >>>>>>> 13:36.26 >>>>>>> 13:10.86 >>>>>>> 13:32.52 >>>>>>> 13:48.59 >>>>>>> >>>>>>> DUT2: >>>>>>> FUA: >>>>>>> 8:11.24 >>>>>>> 7:47.73 >>>>>>> 7:48.00 >>>>>>> 7:48.18 >>>>>>> 7:47.38 >>>>>>> Cache, no FUA: >>>>>>> 8:10.30 >>>>>>> 7:48.97 >>>>>>> 7:48.47 >>>>>>> 7:47.93 >>>>>>> 7:44.18 >>>>>>> >>>>>>> DUT3: >>>>>>> Cache, no FUA: >>>>>>> 7:02.82 >>>>>>> 6:58.94 >>>>>>> 7:03.20 >>>>>>> 7:00.27 >>>>>>> 7:00.88 >>>>>>> FUA: >>>>>>> 7:05.43 >>>>>>> 7:03.44 >>>>>>> 7:04.82 >>>>>>> 7:03.26 >>>>>>> 7:04.74 >>>>>>> >>>>>>> DUT4: >>>>>>> FUA: >>>>>>> 7:23.92 >>>>>>> 7:20.15 >>>>>>> 7:20.52 >>>>>>> 7:19.10 >>>>>>> 7:20.71 >>>>>>> Cache, no FUA: >>>>>>> 7:20.23 >>>>>>> 7:20.48 >>>>>>> 7:19.94 >>>>>>> 7:18.90 >>>>>>> 7:19.88 >>>>>> >>>>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch. >>>>>> >>>>>> Do you agree? >>>>> >>>>> That is a good question, that's why I just posted the data without further comment from my side. >>>>> I was honestly expecting the difference to be much higher, given the original patch. >>>>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO. >>>>> If there are cards where the difference is much more significant then of course a quirk would be nicer. >>>>> On the other side I don't see why not and any improvement is a good one? >>>>> >>>>>> >>>>>>> >>>>>>> Cache, no FUA: >>>>>>> 7:19.36 >>>>>>> 7:02.11 >>>>>>> 7:01.53 >>>>>>> 7:01.35 >>>>>>> 7:00.37 >>>>>>> Cache, no FUA CQE: >>>>>>> 7:17.55 >>>>>>> 7:00.73 >>>>>>> 6:59.25 >>>>>>> 6:58.44 >>>>>>> 6:58.60 >>>>>>> FUA: >>>>>>> 7:15.10 >>>>>>> 6:58.99 >>>>>>> 6:58.94 >>>>>>> 6:59.17 >>>>>>> 6:60.00 >>>>>>> FUA CQE: >>>>>>> 7:11.03 >>>>>>> 6:58.04 >>>>>>> 6:56.89 >>>>>>> 6:56.43 >>>>>>> 6:56:28 >>>>>>> >>>>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know. >>>>>> >>>>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right? >>>>> >>>>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse. >>>>> >>>>>> >>>>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5. Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type(). >>>>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential. >>>>> >>>>>> >>>>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE? >>>>> Sure, I'm also interested in Adrian's take on this. >>>> >>>> Testing an arbitrary system and looking only at individual I/Os, >>>> which may not be representative of any use-case, resulted in >>>> FUA always winning, see below. >>>> >>>> All values are approximate and in microseconds. >>>> >>>> With FUA Without FUA >>>> >>>> With CQE Reliable Write 350 Write 125 >>>> Flush 300 >>>> Total 350 425 >>>> >>>> Without CQE Reliable Write 350 Write 125 >>>> CMD13 100 CMD13 100 >>>> Flush 300 >>>> CMD13 100 >>>> Total 450 625 >>>> >>>> FYI the test I was doing was: >>>> >>>> # cat test.sh >>>> #!/bin/sh >>>> >>>> echo "hi" > /mnt/mmc/hi.txt >>>> >>>> sync >>>> >>>> >>>> # perf record --no-bpf-event -e mmc:* -a -- ./test.sh >>>> # perf script --ns --deltatime >>>> >>>> >>>> The conclusion in this case would seem to be that CQE >>>> makes the case for removing FUA less bad. >>>> >>>> Perhaps CQE is more common in newer eMMCs which in turn >>>> have better FUA implementations. >>> >>> Very interesting data - and thanks for helping out with tests! >>> >>> A few observations and thoughts from the above. >>> >>> 1) >>> A "normal" use case would probably include additional writes (regular >>> writes) and I guess that could impact the flushing behavior. Maybe the >>> flushing becomes less heavy, if the device internally/occasionally >>> needs to flush its cache anyway? Or - maybe it doesn't matter at all, >>> because the reliable writes are triggering the cache to be flushed >>> too. >> >> The sync is presumably causing an EXT4 journal commit which >> seems to use REQ_PREFLUSH and REQ_FUA. That is: >> Flush (the journal to media) >> Write (the commit record) (FUA) >> So it does a flush anyway. The no-FUA case is: >> Flush (the journal to media) >> Write (the commit record) >> Flush (the commit record) >> >>> >>> 2) >>> Assuming that a reliable write is triggering the internal cache to be >>> flushed too, then we need less number of commands to be sent/acked to >>> the eMMC - compared to not using FUA. This means less protocol >>> overhead when using FUA - and perhaps that's what your tests is >>> actually telling us? >> >> There is definitely less protocol overhead because the no-FUA >> case has to do an extra CMD6 (flush) and CMD13. >> >> Note also, in this case auto-CMD23 is being used, which is why >> is is not listed. >> >> Using an older system (no CQE but also auto-CMD23), resulted >> in a win for no-FUA: >> >> With FUA Without FUA >> >> Reliable Write 1200 Write 850 >> CMD13 100 CMD13 100 >> Flush 120 >> CMD13 65 >> Total 1300 1135 >> >> > > Alright, so it seems like just checking whether the cache control > feature is available, isn't sufficient when deciding to avoid FUA. > > That said, in the next version I am going to add a card quirk, which > needs to be set too, to avoid FUA. Then we can see for what cards it > should actually be set for. > > What eMMC cards did you use in your tests above? I don't know the part numbers Manu ID Name Date Machine Newer: 0x11 (Toshiba) 064G30 11/2019 Jasper Lake Older: 0x45 SEM128 07/2014 Lenovo Thinkpad 10