Hi, Greg > > NAND flash memory-based storage devices use Flash Translation Layer (FTL) > > to translate logical addresses of I/O requests to corresponding flash > > memory addresses. Mobile storage devices typically have RAM with > > constrained size, thus lack in memory to keep the whole mapping table. > > Therefore, mapping tables are partially retrieved from NAND flash on > > demand, causing random-read performance degradation. > > > > To improve random read performance, JESD220-3 (HPB v1.0) proposes HPB > > (Host Performance Booster) which uses host system memory as a cache for the > > FTL mapping table. By using HPB, FTL data can be read from host memory > > faster than from NAND flash memory. > > > > The current version only supports the DCM (device control mode). > > This patch consists of 3 parts to support HPB feature. > > > > 1) HPB probe and initialization process > > 2) READ -> HPB READ using cached map information > > 3) L2P (logical to physical) map management > > > > In the HPB probe and init process, the device information of the UFS is > > queried. After checking supported features, the data structure for the HPB > > is initialized according to the device information. > > > > A read I/O in the active sub-region where the map is cached is changed to > > HPB READ by the HPB. > > > > The HPB manages the L2P map using information received from the > > device. For active sub-region, the HPB caches through ufshpb_map > > request. For the in-active region, the HPB discards the L2P map. > > When a write I/O occurs in an active sub-region area, associated dirty > > bitmap checked as dirty for preventing stale read. > > > > HPB is shown to have a performance improvement of 58 - 67% for random read > > workload. [1] > > > > We measured the total start-up time of popular applications and observed > > the difference by enabling the HPB. > > Popular applications are 12 game apps and 24 non-game apps. Each target > > applications were launched in order. The cycle consists of running 36 > > applications in sequence. We repeated the cycle for observing performance > > improvement by L2P mapping cache hit in HPB. > > > > The Following is experiment environment: > > - kernel version: 4.4.0 > > - UFS 2.1 (64GB) > > > > Result: > > +-------+----------+----------+-------+ > > | cycle | baseline | with HPB | diff | > > +-------+----------+----------+-------+ > > | 1 | 272.4 | 264.9 | -7.5 | > > | 2 | 250.4 | 248.2 | -2.2 | > > | 3 | 226.2 | 215.6 | -10.6 | > > | 4 | 230.6 | 214.8 | -15.8 | > > | 5 | 232.0 | 218.1 | -13.9 | > > | 6 | 231.9 | 212.6 | -19.3 | > > +-------+----------+----------+-------+ > > I feel this was burried in the 00 email, shouldn't it go into the 01 > commit changelog so that you can see this? Sure, I will move this result to 01 commit log. > But why does the "cycle" matter here? I think iteration minimizes other factors that affect the start-up time of application. > Can you run a normal benchmark, like fio, on here so we can get some > numbers we know how to compare to other systems with, and possible > reproduct it ourselves? I'm sure fio will easily show random read > performance increases, right? Here is my iozone script: iozone -r 4k -+n -i2 -ecI -t 16 -l 16 -u 16 -s $IO_RANGE/16 -F mnt/tmp_1 mnt/tmp_2 mnt/tmp_3 mnt/tmp_4 mnt/tmp_5 mnt/tmp_6 mnt/tmp_7 mnt/tmp_8 mnt/tmp_9 mnt/tmp_10 mnt/tmp_11 mnt/tmp_12 mnt/tmp_13 mnt/tmp_14 mnt/tmp_15 mnt/tmp_16 Result: +----------+--------+---------+ | IO range | HPB on | HPB off | +----------+--------+---------+ | 1 GB | 294.8 | 300.87 | | 4 GB | 293.51 | 179.35 | | 8 GB | 294.85 | 162.52 | | 16 GB | 293.45 | 156.26 | | 32 GB | 277.4 | 153.25 | +----------+--------+---------+ Thanks, Daejun