On 11/18/2017 03:53 AM, Jaegeuk Kim wrote: > ... >>>>>>>>>>>>>>>>> From: Hyunchul Lee <cheol.lee@xxxxxxx> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Using write hints[1], applications can inform the life time of the data >>>>>>>>>>>>>>>>> written to devices. and this[2] reported that the write hints patch >>>>>>>>>>>>>>>>> decreased writes in NAND by 25%. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This hints help F2FS to determine the followings. >>>>>>>>>>>>>>>>> 1) the segment types where the data will be written. >>>>>>>>>>>>>>>>> 2) the hints that will be passed down to devices with the data of segments. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This patch set implements the first mapping from write hints to segment types >>>>>>>>>>>>>>>>> as shown below. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> hints segment type >>>>>>>>>>>>>>>>> ----- ------------ >>>>>>>>>>>>>>>>> WRITE_LIFE_SHORT CURSEG_COLD_DATA >>>>>>>>>>>>>>>>> WRITE_LIFE_EXTREME CURSEG_HOT_DATA >>>>>>>>>>>>>>>>> others CURSEG_WARM_DATA >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The F2FS poliy for hot/cold seperation has precedence over this hints, And >>>>>>>>>>>>>>>>> hints are not applied in in-place update. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Could we change to disable IPU if file/inode write hint is existing? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am afraid that this makes side effects. for example, this could cause >>>>>>>>>>>>>>> out-of-place updates even when there are not enough free segments. >>>>>>>>>>>>>>> I can write the patch that handles these situations. But I wonder >>>>>>>>>>>>>>> that this is required, and I am not sure which IPU polices can be disabled. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Oh, As I replied in another thread, I think IPU just affects filesystem >>>>>>>>>>>>>> hot/cold separating, rather than this feature. So I think it will be okay >>>>>>>>>>>>>> to not consider it. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Before the second mapping is implemented, write hints are not passed down >>>>>>>>>>>>>>>>> to devices. Because it is better that the data of a segment have the same >>>>>>>>>>>>>>>>> hint. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1]: c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 >>>>>>>>>>>>>>>>> [2]: https://lwn.net/Articles/726477/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Could you write a patch to support passing write hint to block layer for >>>>>>>>>>>>>>>> buffered writes as below commit: >>>>>>>>>>>>>>>> 0127251c45ae ("ext4: add support for passing in write hints for buffered writes") >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sure I will. I wrote it already ;) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cool, ;) >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think that datas from the same segment should be passed down with the same >>>>>>>>>>>>>>> hint, and the following mapping is reasonable. I wonder what is your opinion >>>>>>>>>>>>>>> about it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> segment type hints >>>>>>>>>>>>>>> ------------ ----- >>>>>>>>>>>>>>> CURSEG_COLD_DATA WRITE_LIFE_EXTREME >>>>>>>>>>>>>>> CURSEG_HOT_DATA WRITE_LIFE_SHORT >>>>>>>>>>>>>>> CURSEG_COLD_NODE WRITE_LIFE_NORMAL >>>>>>>>>>>>>> >>>>>>>>>>>>>> We have WRITE_LIFE_LONG defined rather than WRITE_LIFE_NORMAL in fs.h? >>>>>>>>>>>>>> >>>>>>>>>>>>>>> CURSEG_HOT_NODE WRITE_LIFE_MEDIUM >>>>>>>>>>>>>> >>>>>>>>>>>>>> As I know, in scenario of cell phone, data of meta_inode is hottest, then hot >>>>>>>>>>>>>> data, warm node, and cold node should be coldest. So I suggested we can define >>>>>>>>>>>>>> as below: >>>>>>>>>>>>>> >>>>>>>>>>>>>> META_DATA WRITE_LIFE_SHORT >>>>>>>>>>>>>> HOT_DATA & WARM_NODE WRITE_LIFE_MEDIUM >>>>>>>>>>>>>> HOT_NODE & WARM_DATA WRITE_LIFE_LONG >>>>>>>>>>>>>> COLD_NODE & COLD_DATA WRITE_LIFE_EXTREME >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I agree, But I am not sure that assigning the same hint to a node and data >>>>>>>>>>>>> segment is good. Because NVMe is likely to write them in the same erase >>>>>>>>>>>>> block if they have the same hint. >>>>>>>>>>>> >>>>>>>>>>>> If we do not give the hint, they can still be written to the same erase block, >>>>>>>>>> >>>>>>>>>> I mean it's possible to write them to the same erase block. :) >>>>>>>>>> >>>>>>>>>>>> right? it will not be worse? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If the hint is not given, I think that they could be written to >>>>>>>>>>> the same erase block, or not. But if we give the same hint, they are written >>>>>>>>>>> to the same block. >>>>>>>>>> >>>>>>>>>> IMO, Only if underlying device can support more hint type or opened channels, >>>>>>>>>> and actual temperature of data segment and node segment is quite different, we >>>>>>>>>> can separate them. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Okay, If Jaegeuk Kim agrees with this, I will submit the patch that >>>>>>>>> implements your proposed mapping. >>>>>>>> >>>>>>>> How about this? We'd better to split data and node blocks as much as possible. >>>>>>>> >>>>>>>> segment type hints >>>>>>>> ------------ ----- >>>>>>>> COLD_NODE & COLD_DATA WRITE_LIFE_NONE >>>>>>> >>>>>>> WRITE_LIFE_NONE means there is no hints about write life time. >>>>>>> >>>>>>> Shouldn't we define COLD_NODE & COLD_DATA as WRITE_LIFE_EXTERME? >>>>>> >>>>>> The assumption would be to split different types of blocks by flash firmware, >>>>>> so I think we can use WRITE_LIFE_NONE as a type as well. >>>>>> >>>>> >>>>> WRITE_LIFE_NONE means that no stream id is specified. It equals WRITE_LIFE_NOT_SET. >>>> >>>> Rgith, I just saw nvme implementation: >>>> >>>> nvme_assign_write_stream >>>> >>>> enum rw_hint streamid = req->write_hint; >>>> >>>> if (streamid == WRITE_LIFE_NOT_SET || streamid == WRITE_LIFE_NONE) >>>> streamid = 0; >>>> else { >>>> streamid--; >>>> ... >>>> >>>>> So I think that we can define WARM_DATA as WRITE_LIFE_NONE, and >>>>> COLD_NODE & COLD_DATA as WRITE_LIFE_EXTREME. >>> >>> What's the point? >>> >>> segment type hints streamid >>> ------------- ----- ------- >>> COLD_NODE & COLD_DATA WRITE_LIFE_NONE 0 >>> WARM_DATA WRITE_LIFE_EXTERME 4 >>> HOT_NODE & WARM_NODE WRITE_LIFE_LONG 3 >>> HOT_DATA WRITE_LIFE_MEDIUM 2 >>> META_DATA WRITE_LIFE_SHORT 1 >>> >>> So, I don't think something is wrong. Again, I don't care about its hotness >>> given to the naming, but do care how to split different types of blocks with >>> different stream ids. Exceptions would be giving _SHORT or _MEDIUM which are >>> likely to be latency-critical, since I guess firmware may be able to store them >>> into SLC buffer. >>> >>> Am I missing that _NONE has another meaning? >>> >> >> What I am worried about is that datas with no hint have WRITE_LIFE_NOT_SET(id 0). >> If block devices have swap partitions and anothor file systems, cold datas could >> be mixed with datas from that. Does this seems way too much? > > That seems like how to distinguish write_hints across multiple partitions? > What I intend is that because there could be another partitions and the default stream ID is 0, WRITE_LIFE_EXTREAM could be better than WRITE_LIFE_NONE for cold datas. Thanks. >> And I think that stream id 0 means disabling stream directives. >> Becasue NVME_RW_DTYPE_STREAMS is clear. > > Then, I guess SSD FW will just handle 5 stream IDs including disabled 0. > > Thanks, >