RE: Re: [PATCH v14 0/3] scsi: ufs: Add Host Performance Booster Support

Daejun Park <daejun7.park@xxxxxxxxxxx> · Fri, 18 Dec 2020 10:05:20 +0900

Hi, Greg

> > NAND flash memory-based storage devices use Flash Translation Layer (FTL)
> > to translate logical addresses of I/O requests to corresponding flash
> > memory addresses. Mobile storage devices typically have RAM with
> > constrained size, thus lack in memory to keep the whole mapping table.
> > Therefore, mapping tables are partially retrieved from NAND flash on
> > demand, causing random-read performance degradation.
> > 
> > To improve random read performance, JESD220-3 (HPB v1.0) proposes HPB
> > (Host Performance Booster) which uses host system memory as a cache for the
> > FTL mapping table. By using HPB, FTL data can be read from host memory
> > faster than from NAND flash memory. 
> > 
> > The current version only supports the DCM (device control mode).
> > This patch consists of 3 parts to support HPB feature.
> > 
> > 1) HPB probe and initialization process
> > 2) READ -> HPB READ using cached map information
> > 3) L2P (logical to physical) map management
> > 
> > In the HPB probe and init process, the device information of the UFS is
> > queried. After checking supported features, the data structure for the HPB
> > is initialized according to the device information.
> > 
> > A read I/O in the active sub-region where the map is cached is changed to
> > HPB READ by the HPB.
> > 
> > The HPB manages the L2P map using information received from the
> > device. For active sub-region, the HPB caches through ufshpb_map
> > request. For the in-active region, the HPB discards the L2P map.
> > When a write I/O occurs in an active sub-region area, associated dirty
> > bitmap checked as dirty for preventing stale read.
> > 
> > HPB is shown to have a performance improvement of 58 - 67% for random read
> > workload. [1]
> > 
> > We measured the total start-up time of popular applications and observed
> > the difference by enabling the HPB.
> > Popular applications are 12 game apps and 24 non-game apps. Each target
> > applications were launched in order. The cycle consists of running 36
> > applications in sequence. We repeated the cycle for observing performance
> > improvement by L2P mapping cache hit in HPB.
> > 
> > The Following is experiment environment:
> >  - kernel version: 4.4.0 
> >  - UFS 2.1 (64GB)
> > 
> > Result:
> > +-------+----------+----------+-------+
> > | cycle | baseline | with HPB | diff  |
> > +-------+----------+----------+-------+
> > | 1     | 272.4    | 264.9    | -7.5  |
> > | 2     | 250.4    | 248.2    | -2.2  |
> > | 3     | 226.2    | 215.6    | -10.6 |
> > | 4     | 230.6    | 214.8    | -15.8 |
> > | 5     | 232.0    | 218.1    | -13.9 |
> > | 6     | 231.9    | 212.6    | -19.3 |
> > +-------+----------+----------+-------+
> 
> I feel this was burried in the 00 email, shouldn't it go into the 01
> commit changelog so that you can see this?

Sure, I will move this result to 01 commit log.

> But why does the "cycle" matter here?

I think iteration minimizes other factors that affect the start-up time of
application.

> Can you run a normal benchmark, like fio, on here so we can get some
> numbers we know how to compare to other systems with, and possible
> reproduct it ourselves?  I'm sure fio will easily show random read
> performance increases, right?

Here is my iozone script:
iozone -r 4k -+n -i2 -ecI -t 16 -l 16 -u 16 
-s $IO_RANGE/16 -F mnt/tmp_1 mnt/tmp_2 mnt/tmp_3 mnt/tmp_4 
mnt/tmp_5 mnt/tmp_6 mnt/tmp_7 mnt/tmp_8 mnt/tmp_9 mnt/tmp_10 mnt/tmp_11 
mnt/tmp_12 mnt/tmp_13 mnt/tmp_14 mnt/tmp_15 mnt/tmp_16

Result:
+----------+--------+---------+
| IO range | HPB on | HPB off |
+----------+--------+---------+
|   1 GB   | 294.8  | 300.87  |
|   4 GB   | 293.51 | 179.35  |
|   8 GB   | 294.85 | 162.52  |
|  16 GB   | 293.45 | 156.26  |
|  32 GB   | 277.4  | 153.25  |
+----------+--------+---------+

Thanks,
Daejun