On 10/23/23 19:28, Ming Lei wrote:
On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote:
Performance of UFS devices is reduced significantly by the fair tag sharing
algorithm. This is because UFS devices have multiple logical units and a
limited queue depth (32 for UFS 3.1 devices) and also because it takes time to
give tags back after activity on a request queue has stopped. This patch series
addresses this issue by introducing a flag that allows block drivers to
disable fair sharing.
Please consider this patch series for the next merge window.
In previous post[1], you mentioned that the issue is caused by non-IO
queue of WLUN, but in this version, looks there isn't such story any more.
IMO, it isn't reasonable to account non-IO LUN for tag fairness, so
solution could be to not take non-IO queue into account for fair tag
sharing. But disabling fair tag sharing for this whole tagset could be
too over-kill.
And if you mean normal IO LUNs, can you share more details about the
performance drop? such as the test case, how many IO LUNs, and how to
observe performance drop, cause it isn't simple any more since multiple
LUN's perf has to be considered.
[1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@xxxxxxx/
Hi Ming,
Submitting I/O to a WLUN is only one example of a use case that
activates the fair sharing algorithm for UFS devices. Another use
case is simultaneous activity for multiple data LUNs. Conventional
UFS devices typically have four data LUNs and zoned UFS devices
typically have five data LUNs. From an Android device with a zoned UFS
device:
$ adb shell ls /sys/class/scsi_device
0:0:0:0
0:0:0:1
0:0:0:2
0:0:0:3
0:0:0:4
0:0:0:49456
0:0:0:49476
0:0:0:49488
The first five are data logical units. The last three are WLUNs.
For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and
107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15
if I/O is being submitted to two LUNs simultaneously). In other words,
disabling fair sharing results in up to 35% higher IOPS for small reads
and in case two logical units are active simultaneously. I think that's
a very significant performance difference.
Thanks,
Bart.