Re: [RFC PATCH] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Tue, 14 Mar 2023 10:57:51 +0200

On 14/03/23 09:56, Ulf Hansson wrote:
> On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>>
>> On 10/03/23 19:06, Christian Löhle wrote:
>>>
>>>>>
>>>>> I have benchmarked the FUA/Cache behavior a bit.
>>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>>>>>
>>>>> # call with
>>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
>>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
>>>>> ./filesystembenchmark.sh; umount /mnt; done
>>>>> mkfs.ext4 -F /dev/mmcblk2
>>>>> mount /dev/mmcblk2 /mnt
>>>>> for i in {1..3}
>>>>> do
>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>> done
>>>>> for i in {1..3}
>>>>> do
>>>>> rm -r /mnt/$i
>>>>> done
>>>>> for i in {1..3}
>>>>> do
>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>> done
>>>>>
>>>>>
>>>>> I found a couple of DUTs that I can link, I also tested one industrial card.
>>>>>
>>>>> DUT1: blue PCB Foresee eMMC
>>>>> https://pine64.com/product/32gb-emmc-module/
>>>>> DUT2: green PCB SiliconGo eMMC
>>>>> Couldn't find that one online anymore unfortunately
>>>>> DUT3: orange hardkernel PCB 8GB
>>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
>>>>> DUT4: orange hardkernel PCB white dot
>>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
>>>>> id-xu3.html
>>>>> DUT5: Industrial card
>>>>
>>>> Thanks a lot for helping out with testing! Much appreciated!
>>>
>>> No problem, glad to be of help.
>>>
>>>>
>>>>>
>>>>>
>>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>>>>>
>>>>> DUT1:
>>>>> Cache, no FUA:
>>>>> 13:04.49
>>>>> 13:13.82
>>>>> 13:30.59
>>>>> 13:28:13
>>>>> 13:20:64
>>>>> FUA:
>>>>> 13:30.32
>>>>> 13:36.26
>>>>> 13:10.86
>>>>> 13:32.52
>>>>> 13:48.59
>>>>>
>>>>> DUT2:
>>>>> FUA:
>>>>> 8:11.24
>>>>> 7:47.73
>>>>> 7:48.00
>>>>> 7:48.18
>>>>> 7:47.38
>>>>> Cache, no FUA:
>>>>> 8:10.30
>>>>> 7:48.97
>>>>> 7:48.47
>>>>> 7:47.93
>>>>> 7:44.18
>>>>>
>>>>> DUT3:
>>>>> Cache, no FUA:
>>>>> 7:02.82
>>>>> 6:58.94
>>>>> 7:03.20
>>>>> 7:00.27
>>>>> 7:00.88
>>>>> FUA:
>>>>> 7:05.43
>>>>> 7:03.44
>>>>> 7:04.82
>>>>> 7:03.26
>>>>> 7:04.74
>>>>>
>>>>> DUT4:
>>>>> FUA:
>>>>> 7:23.92
>>>>> 7:20.15
>>>>> 7:20.52
>>>>> 7:19.10
>>>>> 7:20.71
>>>>> Cache, no FUA:
>>>>> 7:20.23
>>>>> 7:20.48
>>>>> 7:19.94
>>>>> 7:18.90
>>>>> 7:19.88
>>>>
>>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
>>>>
>>>> Do you agree?
>>>
>>> That is a good question, that's why I just posted the data without further comment from my side.
>>> I was honestly expecting the difference to be much higher, given the original patch.
>>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
>>> If there are cards where the difference is much more significant then of course a quirk would be nicer.
>>> On the other side I don't see why not and any improvement is a good one?
>>>
>>>>
>>>>>
>>>>> Cache, no FUA:
>>>>> 7:19.36
>>>>> 7:02.11
>>>>> 7:01.53
>>>>> 7:01.35
>>>>> 7:00.37
>>>>> Cache, no FUA CQE:
>>>>> 7:17.55
>>>>> 7:00.73
>>>>> 6:59.25
>>>>> 6:58.44
>>>>> 6:58.60
>>>>> FUA:
>>>>> 7:15.10
>>>>> 6:58.99
>>>>> 6:58.94
>>>>> 6:59.17
>>>>> 6:60.00
>>>>> FUA CQE:
>>>>> 7:11.03
>>>>> 6:58.04
>>>>> 6:56.89
>>>>> 6:56.43
>>>>> 6:56:28
>>>>>
>>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
>>>>
>>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
>>>
>>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
>>>
>>>>
>>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
>>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
>>>
>>>>
>>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
>>> Sure, I'm also interested in Adrian's take on this.
>>
>> Testing an arbitrary system and looking only at individual I/Os,
>> which may not be representative of any use-case, resulted in
>> FUA always winning, see below.
>>
>> All values are approximate and in microseconds.
>>
>>                 With FUA                Without FUA
>>
>> With CQE        Reliable Write  350     Write   125
>>                                         Flush   300
>>                 Total           350             425
>>
>> Without CQE     Reliable Write  350     Write   125
>>                 CMD13           100     CMD13   100
>>                                         Flush   300
>>                                         CMD13   100
>>                 Total           450             625
>>
>> FYI the test I was doing was:
>>
>>   # cat test.sh
>>         #!/bin/sh
>>
>>         echo "hi" > /mnt/mmc/hi.txt
>>
>>         sync
>>
>>
>>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
>>   # perf script --ns --deltatime
>>
>>
>> The conclusion in this case would seem to be that CQE
>> makes the case for removing FUA less bad.
>>
>> Perhaps CQE is more common in newer eMMCs which in turn
>> have better FUA implementations.
> 
> Very interesting data - and thanks for helping out with tests!
> 
> A few observations and thoughts from the above.
> 
> 1)
> A "normal" use case would probably include additional writes (regular
> writes) and I guess that could impact the flushing behavior. Maybe the
> flushing becomes less heavy, if the device internally/occasionally
> needs to flush its cache anyway? Or - maybe it doesn't matter at all,
> because the reliable writes are triggering the cache to be flushed
> too.

The sync is presumably causing an EXT4 journal commit which
seems to use REQ_PREFLUSH and REQ_FUA. That is:
	Flush (the journal to media)
	Write (the commit record) (FUA)
So it does a flush anyway.  The no-FUA case is:
	Flush (the journal to media)
	Write (the commit record)
	Flush (the commit record)

> 
> 2)
> Assuming that a reliable write is triggering the internal cache to be
> flushed too, then we need less number of commands to be sent/acked to
> the eMMC - compared to not using FUA. This means less protocol
> overhead when using FUA - and perhaps that's what your tests is
> actually telling us?

There is definitely less protocol overhead because the no-FUA
case has to do an extra CMD6 (flush) and CMD13.

Note also, in this case auto-CMD23 is being used, which is why
is is not listed.

Using an older system (no CQE but also auto-CMD23), resulted
in a win for no-FUA:

		With FUA		Without FUA

		Reliable Write	1200	Write	 850
		CMD13		 100	CMD13	 100
					Flush	 120
					CMD13	  65
		Total		1300		1135