Re: [RFC PATCH] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Thu, 16 Mar 2023 14:44:35 +0200



On 16/03/23 14:12, Ulf Hansson wrote:
> On Tue, 14 Mar 2023 at 09:58, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>>
>> On 14/03/23 09:56, Ulf Hansson wrote:
>>> On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>>>>
>>>> On 10/03/23 19:06, Christian Löhle wrote:
>>>>>
>>>>>>>
>>>>>>> I have benchmarked the FUA/Cache behavior a bit.
>>>>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>>>>>>>
>>>>>>> # call with
>>>>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
>>>>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
>>>>>>> ./filesystembenchmark.sh; umount /mnt; done
>>>>>>> mkfs.ext4 -F /dev/mmcblk2
>>>>>>> mount /dev/mmcblk2 /mnt
>>>>>>> for i in {1..3}
>>>>>>> do
>>>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>>>> done
>>>>>>> for i in {1..3}
>>>>>>> do
>>>>>>> rm -r /mnt/$i
>>>>>>> done
>>>>>>> for i in {1..3}
>>>>>>> do
>>>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>>>> done
>>>>>>>
>>>>>>>
>>>>>>> I found a couple of DUTs that I can link, I also tested one industrial card.
>>>>>>>
>>>>>>> DUT1: blue PCB Foresee eMMC
>>>>>>> https://pine64.com/product/32gb-emmc-module/
>>>>>>> DUT2: green PCB SiliconGo eMMC
>>>>>>> Couldn't find that one online anymore unfortunately
>>>>>>> DUT3: orange hardkernel PCB 8GB
>>>>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
>>>>>>> DUT4: orange hardkernel PCB white dot
>>>>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
>>>>>>> id-xu3.html
>>>>>>> DUT5: Industrial card
>>>>>>
>>>>>> Thanks a lot for helping out with testing! Much appreciated!
>>>>>
>>>>> No problem, glad to be of help.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>>>>>>>
>>>>>>> DUT1:
>>>>>>> Cache, no FUA:
>>>>>>> 13:04.49
>>>>>>> 13:13.82
>>>>>>> 13:30.59
>>>>>>> 13:28:13
>>>>>>> 13:20:64
>>>>>>> FUA:
>>>>>>> 13:30.32
>>>>>>> 13:36.26
>>>>>>> 13:10.86
>>>>>>> 13:32.52
>>>>>>> 13:48.59
>>>>>>>
>>>>>>> DUT2:
>>>>>>> FUA:
>>>>>>> 8:11.24
>>>>>>> 7:47.73
>>>>>>> 7:48.00
>>>>>>> 7:48.18
>>>>>>> 7:47.38
>>>>>>> Cache, no FUA:
>>>>>>> 8:10.30
>>>>>>> 7:48.97
>>>>>>> 7:48.47
>>>>>>> 7:47.93
>>>>>>> 7:44.18
>>>>>>>
>>>>>>> DUT3:
>>>>>>> Cache, no FUA:
>>>>>>> 7:02.82
>>>>>>> 6:58.94
>>>>>>> 7:03.20
>>>>>>> 7:00.27
>>>>>>> 7:00.88
>>>>>>> FUA:
>>>>>>> 7:05.43
>>>>>>> 7:03.44
>>>>>>> 7:04.82
>>>>>>> 7:03.26
>>>>>>> 7:04.74
>>>>>>>
>>>>>>> DUT4:
>>>>>>> FUA:
>>>>>>> 7:23.92
>>>>>>> 7:20.15
>>>>>>> 7:20.52
>>>>>>> 7:19.10
>>>>>>> 7:20.71
>>>>>>> Cache, no FUA:
>>>>>>> 7:20.23
>>>>>>> 7:20.48
>>>>>>> 7:19.94
>>>>>>> 7:18.90
>>>>>>> 7:19.88
>>>>>>
>>>>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
>>>>>>
>>>>>> Do you agree?
>>>>>
>>>>> That is a good question, that's why I just posted the data without further comment from my side.
>>>>> I was honestly expecting the difference to be much higher, given the original patch.
>>>>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
>>>>> If there are cards where the difference is much more significant then of course a quirk would be nicer.
>>>>> On the other side I don't see why not and any improvement is a good one?
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cache, no FUA:
>>>>>>> 7:19.36
>>>>>>> 7:02.11
>>>>>>> 7:01.53
>>>>>>> 7:01.35
>>>>>>> 7:00.37
>>>>>>> Cache, no FUA CQE:
>>>>>>> 7:17.55
>>>>>>> 7:00.73
>>>>>>> 6:59.25
>>>>>>> 6:58.44
>>>>>>> 6:58.60
>>>>>>> FUA:
>>>>>>> 7:15.10
>>>>>>> 6:58.99
>>>>>>> 6:58.94
>>>>>>> 6:59.17
>>>>>>> 6:60.00
>>>>>>> FUA CQE:
>>>>>>> 7:11.03
>>>>>>> 6:58.04
>>>>>>> 6:56.89
>>>>>>> 6:56.43
>>>>>>> 6:56:28
>>>>>>>
>>>>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
>>>>>>
>>>>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
>>>>>
>>>>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
>>>>>
>>>>>>
>>>>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
>>>>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
>>>>>
>>>>>>
>>>>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
>>>>> Sure, I'm also interested in Adrian's take on this.
>>>>
>>>> Testing an arbitrary system and looking only at individual I/Os,
>>>> which may not be representative of any use-case, resulted in
>>>> FUA always winning, see below.
>>>>
>>>> All values are approximate and in microseconds.
>>>>
>>>>                 With FUA                Without FUA
>>>>
>>>> With CQE        Reliable Write  350     Write   125
>>>>                                         Flush   300
>>>>                 Total           350             425
>>>>
>>>> Without CQE     Reliable Write  350     Write   125
>>>>                 CMD13           100     CMD13   100
>>>>                                         Flush   300
>>>>                                         CMD13   100
>>>>                 Total           450             625
>>>>
>>>> FYI the test I was doing was:
>>>>
>>>>   # cat test.sh
>>>>         #!/bin/sh
>>>>
>>>>         echo "hi" > /mnt/mmc/hi.txt
>>>>
>>>>         sync
>>>>
>>>>
>>>>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
>>>>   # perf script --ns --deltatime
>>>>
>>>>
>>>> The conclusion in this case would seem to be that CQE
>>>> makes the case for removing FUA less bad.
>>>>
>>>> Perhaps CQE is more common in newer eMMCs which in turn
>>>> have better FUA implementations.
>>>
>>> Very interesting data - and thanks for helping out with tests!
>>>
>>> A few observations and thoughts from the above.
>>>
>>> 1)
>>> A "normal" use case would probably include additional writes (regular
>>> writes) and I guess that could impact the flushing behavior. Maybe the
>>> flushing becomes less heavy, if the device internally/occasionally
>>> needs to flush its cache anyway? Or - maybe it doesn't matter at all,
>>> because the reliable writes are triggering the cache to be flushed
>>> too.
>>
>> The sync is presumably causing an EXT4 journal commit which
>> seems to use REQ_PREFLUSH and REQ_FUA. That is:
>>         Flush (the journal to media)
>>         Write (the commit record) (FUA)
>> So it does a flush anyway.  The no-FUA case is:
>>         Flush (the journal to media)
>>         Write (the commit record)
>>         Flush (the commit record)
>>
>>>
>>> 2)
>>> Assuming that a reliable write is triggering the internal cache to be
>>> flushed too, then we need less number of commands to be sent/acked to
>>> the eMMC - compared to not using FUA. This means less protocol
>>> overhead when using FUA - and perhaps that's what your tests is
>>> actually telling us?
>>
>> There is definitely less protocol overhead because the no-FUA
>> case has to do an extra CMD6 (flush) and CMD13.
>>
>> Note also, in this case auto-CMD23 is being used, which is why
>> is is not listed.
>>
>> Using an older system (no CQE but also auto-CMD23), resulted
>> in a win for no-FUA:
>>
>>                 With FUA                Without FUA
>>
>>                 Reliable Write  1200    Write    850
>>                 CMD13            100    CMD13    100
>>                                         Flush    120
>>                                         CMD13     65
>>                 Total           1300            1135
>>
>>
> 
> Alright, so it seems like just checking whether the cache control
> feature is available, isn't sufficient when deciding to avoid FUA.
> 
> That said, in the next version I am going to add a card quirk, which
> needs to be set too, to avoid FUA. Then we can see for what cards it
> should actually be set for.
> 
> What eMMC cards did you use in your tests above?

I don't know the part numbers

	Manu ID 	Name	Date		Machine
Newer:	0x11 (Toshiba)	064G30	11/2019		Jasper Lake
Older:	0x45		SEM128	07/2014		Lenovo Thinkpad 10