RE: [RFC PATCH] Introduce generalized data temperature estimation framework

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> · Tue, 28 Jan 2025 22:30:55 +0000

On Tue, 2025-01-28 at 09:45 +0100, Hans Holmberg wrote:
> On Mon, Jan 27, 2025 at 9:59 PM Viacheslav Dubeyko
> <Slava.Dubeyko@xxxxxxx> wrote:
> > 
> > On Mon, 2025-01-27 at 15:19 +0100, Hans Holmberg wrote:
> > > On Fri, Jan 24, 2025 at 10:03 PM Viacheslav Dubeyko
> > > <Slava.Dubeyko@xxxxxxx> wrote:
> > > > 
> > > > 

<skipped>

> > > > > > 
> > > > > > [HOW TO USE THE APPROACH]
> > > > > > The lifetime of data "temperature" value for a file
> > > > > > can be explained by steps: (1) iget() method sets
> > > > > > the data "temperature" object; (2) folio_account_dirtied()
> > > > > > method accounts the number of dirty memory pages and
> > > > > > tries to estimate the current temperature of the file;
> > > > > > (3) folio_clear_dirty_for_io() decrease number of dirty
> > > > > > memory pages and increases number of updated pages;
> > > > > > (4) folio_account_dirtied() also decreases file's
> > > > > > "temperature" if updates hasn't happened some time;
> > > > > > (5) file system can get file's temperature and
> > > > > > to share the hint with block layer; (6) inode
> > > > > > eviction method removes and free the data "temperature"
> > > > > > object.
> > > > > 
> > > > > I don't want to pour gasoline on old flame wars, but what is the
> > > > > advantage of this auto-magic data temperature framework vs the existing
> > > > > framework?
> > > > > 
> > > > 
> > > > There is no magic in this framework. :) It's simple and compact framework.
> > > > 
> > > > >  'enum rw_hint' has temperature in the range of none, short,
> > > > > medium, long and extreme (what ever that means), can be set by an
> > > > > application via an fcntl() and is plumbed down all the way to the bio
> > > > > level by most FSes that care.
> > > > 
> > > > I see your point. But the 'enum rw_hint' defines qualitative grades again:
> > > > 
> > > > enum rw_hint {
> > > >         WRITE_LIFE_NOT_SET      = RWH_WRITE_LIFE_NOT_SET,
> > > >         WRITE_LIFE_NONE         = RWH_WRITE_LIFE_NONE,
> > > >         WRITE_LIFE_SHORT        = RWH_WRITE_LIFE_SHORT,  <-- HOT data
> > > >         WRITE_LIFE_MEDIUM       = RWH_WRITE_LIFE_MEDIUM, <-- WARM data
> > > >         WRITE_LIFE_LONG         = RWH_WRITE_LIFE_LONG,   <-- COLD data
> > > >         WRITE_LIFE_EXTREME      = RWH_WRITE_LIFE_EXTREME,
> > > > } __packed;
> > > > 
> > > > First of all, again, it's hard to compare the hotness of different files
> > > > on such qualitative basis. Secondly, who decides what is hotness of a particular
> > > > data? People can only guess or assume the nature of data based on
> > > > experience in the past. But workloads are changing and evolving
> > > > continuously and in real-time manner. Technically speaking, application can
> > > > try to estimate the hotness of data, but, again, file system can receive
> > > > requests from multiple threads and multiple applications. So, application
> > > > can guess about real nature of data too. Especially, nobody would like
> > > > to implement dedicated logic in application for data hotness estimation.
> > > > 
> > > > This framework is inode based and it tries to estimate file's
> > > > "temperature" on quantitative basis. Advantages of this framework:
> > > > (1) we don't need to guess about data hotness, temperature will be
> > > > calculated quantitatively; (2) quantitative basis gives opportunity
> > > > for fair comparison of different files' temperature; (3) file's temperature
> > > > will change with workload(s) changing in real-time; (4) file's
> > > > temperature will be correctly accounted under the load from multiple
> > > > applications. I believe these are advantages of the suggested framework.
> > > > 
> > > 
> > > While I think the general idea(using file-overwrite-rates as a
> > > parameter when doing data placement) could be useful, it could not
> > > replace the user space hinting we already have.
> > > 
> > > Applications(e.g. RocksDB) doing sequential writes to files that are
> > > immutable until deleted(no overwrites) would not benefit. We need user
> > > space help to estimate data lifetime for those workloads and the
> > > relative write lifetime hints are useful for that.
> > > 
> > 
> > I don't see any competition or conflict here. Suggested approach and user-space
> > hinting could be complementary techniques. If user-space logic would like to use
> > a special data placement policy, then it can share hints in its own way. But,
> > potentially, suggested approach of temperature calculation can be used to check
> > the effectiveness of the user-space hinting, and, maybe, correcting it. So, I
> > don't see any conflict here.
> 
> I don't see a conflict here either, my point is just that this
> framework cannot replace the user hints.
> 

I have no intentions to replace any existing techniques. :)

> > 
> > > So what I am asking myself is if this framework is added, who would
> > > benefit? Without any benchmark results it's a bit hard to tell :)
> > > 
> > 
> > Which benefits would you like to see? I assume we would like: (1) prolong device
> > lifetime, (2) improve performance, (3) decrease GC burden. Do you mean these
> > benefits?
> 
> Yep, decreased write amplification essentially.
> 

The important point here that the suggested framework offers only means to
estimate temperature. But only file system technique can decrease or increase
write amplification. So, we need to compare apples with apples. As far as I
know, F2FS has algorithm of estimation and employing temperature. Do you imply
F2FS or how do you see the way of estimation the write amplification decreasing?
Because, every file system should have own way to employ temperature.

> > 
> > As far as I can see, different file systems can use temperature in different
> > way. And this is slightly complicates the benchmarking. So, how can we define
> > the effectiveness here and how can we measure it? Do you have a vision here? I
> > am happy to make more benchmarking.
> > 
> > My point is that the calculated file's temperature gives the quantitative way to
> > distribute even user data among several temperature groups ("baskets"). And
> > these baskets/segments/anything-else gives the way to properly group data. File
> > systems can employ the temperature in various ways, but it can definitely helps
> > to elaborate proper data placement policy. As a result, GC burden can be
> > decreased, performance can be improved, and lifetime device can be prolong. So,
> > how can we benchmark these points? And which approaches make sense to compare?
> > 
> 
> To start off, it would be nice to demonstrate that write amplification
> decreases for some workload when the temperature is taken into
> account. It would be great if the workload would be an actual
> application workload or a synthetic one mimicking some real-world-like
> use case.
> Run the same workload twice, measure write amplification and compare results.
> 

Another trouble here. What is the way to measure write amplification, from your
point of view? Which benchmarking tool or framework do you suggest for write
amplification estimation?

> What user workloads do you see benefiting from this framework? Which would not?
> 

We need to talk at first about file system mechanism to employ data temperature
in efficient way. Because there is no universal way to employ data temperature
and different file system can implement completely different techniques. And
only then it will be possible to estimate which file system can provides
benefits for a particular workload. Suggested framework only estimates the
temperature.

> > > Also, is there a good reason for only supporting buffered io? Direct
> > > IO could benefit in the same way, right?
> > > 
> > 
> > I think that Direct IO could benefit too. The question here how to account dirty
> > memory pages and updated memory pages. Currently, I am using
> > folio_account_dirtied() and folio_clear_dirty_for_io() to implement the
> > calculation the temperature. As far as I can see, Direct IO requires another
> > methods of doing this. The rest logic can be the same.
> 
> It's probably a good idea to cover direct IO as well then as this is
> intended to be a generalized framework.

To cover Direct IO is a good point. But even page cache based approach makes
sense because LFS and GC based file systems needs to manage data in efficient
way. By the way, do you have a vision which methods can be used for the case of
Direct IO to account dirty and updated memory pages?

Thanks,
Slava.