On Tue, Oct 24, 2023 at 09:41:50AM -0700, Bart Van Assche wrote: > On 10/23/23 19:28, Ming Lei wrote: > > On Mon, Oct 23, 2023 at 01:36:32PM -0700, Bart Van Assche wrote: > > > Performance of UFS devices is reduced significantly by the fair tag sharing > > > algorithm. This is because UFS devices have multiple logical units and a > > > limited queue depth (32 for UFS 3.1 devices) and also because it takes time to > > > give tags back after activity on a request queue has stopped. This patch series > > > addresses this issue by introducing a flag that allows block drivers to > > > disable fair sharing. > > > > > > Please consider this patch series for the next merge window. > > > > In previous post[1], you mentioned that the issue is caused by non-IO > > queue of WLUN, but in this version, looks there isn't such story any more. > > > > IMO, it isn't reasonable to account non-IO LUN for tag fairness, so > > solution could be to not take non-IO queue into account for fair tag > > sharing. But disabling fair tag sharing for this whole tagset could be > > too over-kill. > > > > And if you mean normal IO LUNs, can you share more details about the > > performance drop? such as the test case, how many IO LUNs, and how to > > observe performance drop, cause it isn't simple any more since multiple > > LUN's perf has to be considered. > > > > [1] https://lore.kernel.org/linux-block/20231018180056.2151711-1-bvanassche@xxxxxxx/ > > Hi Ming, > > Submitting I/O to a WLUN is only one example of a use case that > activates the fair sharing algorithm for UFS devices. Another use > case is simultaneous activity for multiple data LUNs. Conventional > UFS devices typically have four data LUNs and zoned UFS devices > typically have five data LUNs. From an Android device with a zoned UFS > device: > > $ adb shell ls /sys/class/scsi_device > 0:0:0:0 > 0:0:0:1 > 0:0:0:2 > 0:0:0:3 > 0:0:0:4 > 0:0:0:49456 > 0:0:0:49476 > 0:0:0:49488 > > The first five are data logical units. The last three are WLUNs. > > For a block size of 4 KiB, I see 144 K IOPS for queue depth 31 and > 107 K IOPS for queue depth 15 (queue depth is reduced from 31 to 15 > if I/O is being submitted to two LUNs simultaneously). In other words, > disabling fair sharing results in up to 35% higher IOPS for small reads > and in case two logical units are active simultaneously. I think that's > a very significant performance difference. Yeah, performance does drop when queue depth is cut to half if queue depth is low enough. However, it isn't enough to just test perf over one LUN, what is the perf effect when running IOs over the 2 or 5 data LUNs concurrently? SATA should have similar issue too, and I think the improvement may be more generic to bypass fair tag sharing in case of low queue depth (such as < 32) if turns out the fair tag sharing doesn't work well in case low queue depth. Also the 'fairness' could be enhanced dynamically by scsi LUN's queue depth, which can be adjusted dynamically. Thanks, Ming