RE: [RFC PATCH] Introduce generalized data temperature estimation framework

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> · Mon, 27 Jan 2025 20:12:45 +0000

On Sat, 2025-01-25 at 07:25 -0500, Jeff Layton wrote:
> On Thu, 2025-01-23 at 12:24 -0800, Viacheslav Dubeyko wrote:
> > [PROBLEM DECLARATION]
> > Efficient data placement policy is a Holy Grail for data
> > storage and file system engineers. Achieving this goal is
> > equally important and really hard. Multiple data storage
> > and file system technologies have been invented to manage
> > the data placement policy (for example, COW, ZNS, FDP, etc).
> > But these technologies still require the hints related to
> > nature of data from application side.
> > 
> > [DATA "TEMPERATURE" CONCEPT]
> > One of the widely used and intuitively clear idea of data
> > nature definition is data "temperature" (cold, warm,
> > hot data). However, data "temperature" is as intuitively
> > sound as illusive definition of data nature. Generally
> > speaking, thermodynamics defines temperature as a way
> > to estimate the average kinetic energy of vibrating
> > atoms in a substance. But we cannot see a direct analogy
> > between data "temperature" and temperature in physics
> > because data is not something that has kinetic energy.
> > 
> > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION]
> > We usually imply that if some data is updated more
> > frequently, then such data is more hot than other one.
> > But, it is possible to see several problems here:
> > (1) How can we estimate the data "hotness" in
> > quantitative way? (2) We can state that data is "hot"
> > after some number of updates. It means that this
> > definition implies state of the data in the past.
> > Will this data continue to be "hot" in the future?
> > Generally speaking, the crucial problem is how to define
> > the data nature or data "temperature" in the future.
> > Because, this knowledge is the fundamental basis for
> > elaboration an efficient data placement policy.
> > Generalized data "temperature" estimation framework
> > suggests the way to define a future state of the data
> > and the basis for quantitative measurement of data
> > "temperature".
> > 
> > [ARCHITECTURE OF FRAMEWORK]
> > Usually, file system has a page cache for every inode. And
> > initially memory pages become dirty in page cache. Finally,
> > dirty pages will be sent to storage device. Technically
> > speaking, the number of dirty pages in a particular page
> > cache is the quantitative measurement of current "hotness"
> > of a file. But number of dirty pages is still not stable
> > basis for quantitative measurement of data "temperature".
> > It is possible to suggest of using the total number of
> > logical blocks in a file as a unit of one degree of data
> > "temperature". As a result, if the whole file was updated
> > several times, then "temperature" of the file has been
> > increased for several degrees. And if the file is under
> > continous updates, then the file "temperature" is growing.
> > 
> > We need to keep not only current number of dirty pages,
> > but also the number of updated pages in the near past
> > for accumulating the total "temperature" of a file.
> > Generally speaking, total number of updated pages in the
> > nearest past defines the aggregated "temperature" of file.
> > And number of dirty pages defines the delta of
> > "temperature" growth for current update operation.
> > This approach defines the mechanism of "temperature" growth.
> > 
> > But if we have no more updates for the file, then
> > "temperature" needs to decrease. Starting and ending
> > timestamps of update operation can work as a basis for
> > decreasing "temperature" of a file. If we know the number
> > of updated logical blocks of the file, then we can divide
> > the duration of update operation on number of updated
> > logical blocks. As a result, this is the way to define
> > a time duration per one logical block. By means of
> > multiplying this value (time duration per one logical
> > block) on total number of logical blocks in file, we
> > can calculate the time duration of "temperature"
> > decreasing for one degree. Finally, the operation of
> > division the time range (between end of last update
> > operation and begin of new update operation) on
> > the time duration of "temperature" decreasing for
> > one degree provides the way to define how many
> > degrees should be subtracted from current "temperature"
> > of the file.
> > 
> > [HOW TO USE THE APPROACH]
> > The lifetime of data "temperature" value for a file
> > can be explained by steps: (1) iget() method sets
> > the data "temperature" object; (2) folio_account_dirtied()
> > method accounts the number of dirty memory pages and
> > tries to estimate the current temperature of the file;
> > (3) folio_clear_dirty_for_io() decrease number of dirty
> > memory pages and increases number of updated pages;
> > (4) folio_account_dirtied() also decreases file's
> > "temperature" if updates hasn't happened some time;
> > (5) file system can get file's temperature and
> > to share the hint with block layer; (6) inode
> > eviction method removes and free the data "temperature"
> > object.
> > 
> > Signed-off-by: Viacheslav Dubeyko <slava@xxxxxxxxxxx>
> > ---
> >  fs/Kconfig                             |   2 +
> >  fs/Makefile                            |   1 +
> >  fs/data-temperature/Kconfig            |  11 +
> >  fs/data-temperature/Makefile           |   3 +
> >  fs/data-temperature/data_temperature.c | 347 +++++++++++++++++++++++++
> >  include/linux/data_temperature.h       | 124 +++++++++
> >  include/linux/fs.h                     |   4 +
> >  mm/page-writeback.c                    |   9 +
> >  8 files changed, 501 insertions(+)
> >  create mode 100644 fs/data-temperature/Kconfig
> >  create mode 100644 fs/data-temperature/Makefile
> >  create mode 100644 fs/data-temperature/data_temperature.c
> >  create mode 100644 include/linux/data_temperature.h
> > 
> 
> 
> This seems like an interesting idea, but how do you intend to use the
> temperature?
> 

Yes, it's not complete implementation. The complete implementation requires of
modification of particular file system(s). And I am simply sharing the initial
vision.

Potentially, different file system can use the temperature in different way. The
simplest approach is to provide the temperature as a hint for block layer and
this hint value can be used by FDP SSD, for example. But file system itself can
use temperature value for elaborating data placement policy. If file system uses
segment concept, then different type of segments can store data with different
temperature. Usually, it is easy to store different types of metadata in
different segments. However, even different types of metadata could be grouped
on temperature basis. But proper placement policy for user data is always hard
point for file system. So, temperature basis provides the way to introduce a set
of segments that can receive user data with different temperature.

But even if file system doesn't use the segment concept, then multiple file
systems use the Allocation Groups concept. And, potentially, files with
different temperatures can be stored or grouped into different Allocation
groups.

I believe, potentially, GC subsystem of LFS file systems can use the temperature
to elaborate more efficient policy. Because it is clear that files' content with
high temperature don't need to be processed by GC. I don't have in mind the
clear algorithm of this policy, but hot segments can be cleaned without GC
intervention, for example.

Also, interesting point that this approach is trying to decrease temperature if
number of updates is decreasing. It means that COW policy can store file's
content in segments with different temperature for every update of following to
temperature changing with time. However, different portion of big file can be
distributed among multiple segments. But, big file is always distributed among
multiple segments. 

> With this patch, it looks like you're just calculating it, but there is
> nothing that uses it and there is no way to access the temperature from
> userland. It would be nice to see this value used by an existing
> subsystem to drive data placement so we can see how it will help
> things.
> 
> > 

I did benchmarking by using SSDFS file system (but any other file system can  be
used for benchmarking too). And I am going to introduce several current segments
for user data with the goal to distribute user data with various temperature.
Also, as I mentioned, these current segments can be stored by providing hints to
FDP SSD through block layer logic. And I shared above potential ways how various
file systems can employ the calculated temperature.

Related to userland... I didn't consider to share the temperature with user-
space subsystems. But it is the great point. Potentially, it is easy to
introduce an ioctl that can retrieve the temperature of a particular file. Or
maybe sysfs can be used to expose the distribution of data among temperature
groups/ranges. And application can use this data to elaborate data placement
policy. Let me think about it more.

Thanks,
Slava.