Re: VFS hot tracking: How to calculate data temperature?

Zheng Liu <gnehzuil.liu@xxxxxxxxx> · Thu, 8 Nov 2012 10:48:15 +0800

On Wed, Nov 07, 2012 at 11:25:33AM -0800, Darrick J. Wong wrote:
> On Wed, Nov 07, 2012 at 02:36:42PM +0800, Zheng Liu wrote:
> > On Tue, Nov 06, 2012 at 03:10:11PM -0800, Darrick J. Wong wrote:
> > > On Tue, Nov 06, 2012 at 05:36:38PM +0800, Ram Pai wrote:
> > > > On Fri, Nov 02, 2012 at 04:41:09PM +0800, Zheng Liu wrote:
> > > > > On Fri, Nov 02, 2012 at 02:38:29PM +0800, Zhi Yong Wu wrote:
> > > > > > Here also has another question.
> > > > > > 
> > > > > > How to save the file temperature among the umount to be able to
> > > > > > preserve the file tempreture after reboot?
> > > > > > 
> > > > > > This above is the requirement from DB product.
> > > > > > I thought that we can save file temperature in its inode struct, that
> > > > > > is, add one new field in struct inode, then this info will be written
> > > > > > to disk with inode.
> > > > > > 
> > > > > > Any comments or ideas are appreciated, thanks.
> > > > > 
> > > > > Hi Zhiyong,
> > > > > 
> > > > > I think that we might define a callback function.  If a filesystem wants
> > > > > to save these data, it can implement a function to save them.  The
> > > > > filesystem can decide whether adding it or not by themselves.
> > > > > 
> > > > > BTW, actually I don't really care about how to save these data because I
> > > > > only want to observe which file is accessed in real time, which is very
> > > > > useful for me to track a problem in our product system.
> > > > 
> > > > To me, umounting a filesystem is a way of explicitly telling the VFS that the
> > > > filesystem's data is not hot anymore. So probably, it really does not make
> > > > sense to store temperatures across mount boundaries.
> > > 
> > > I'd prefer that file heat data to be retained across mounts -- we shouldn't
> > > throw away all of our observations just because of a system crash / power
> > > outage / scheduled reboot.
> > > 
> > > Or, imagine if you're a defragging tool.  If you're clever enough to try
> > > consolidating all the hot blocks in one place on disk so that you could
> > > aggressively read them all in at once (e.g. ureadahead), I think you'd want to
> > > be able to access as big of an observation pool as possible.
> > > 
> > > This just occurred to me -- are you saving all of the file's heat data, like
> > > the per-range read/write counters, and the averages?  Or just a single compiled
> > > heat rating for the whole file?  I suggested a big hidden file a few days ago
> > > because I'd thought you were trying to save all the range/heat data, which
> > > would probably be painful to shoehorn into an xattr.  If you're only storing a
> > > single number, then the xattr way is probably ok.
> > 
> > Hi Darrick,
> > 
> > Maybe the best way is that a new mount option or a switch in sysfs is
> > provided to turn on/off it.  The user can decide whether it is enabled
> > or not.  After all it will bring some extra overhead.  At least turning
> > it on in our product system is unacceptable for me if there is no any
> > problem that I need to track.
> 
> Hmm... who are the intended in-kernel users of the hot tracking feature?  I'm
> starting to wonder if it's possible (or desirable) to implement some of this in
> userspace and have the kernel ask for the hot data as needed, or simply write a
> driver program that handles the strategy and only needs the kernel interface
> that moves extents around.  I feel like we could just write a regular program
> that uses ftrace to record io activity and manage all the observations that we
> pick up, and then the db, defrag, dedupe, etc. programs can just call into
> that?
> 
> On the other hand, writing some daemon program has its own problems with
> distribution, starting it up, and killing it off at shutdown.  But it would
> make Zheng's (non)use case easier -- if you don't want it, don't run it.

In fact, I often need to help a application developer to find some IO
problems.  So let me describe my own solution, please.  I usually use
blktrace to grab some IO activities, run a script to filter some
read/write IO requests which contain sector number in disk.  Then I use
'ex' command in debugfs (for ext4 filesystem) to get the layout of all
extents.  Finally, I write a script program to get which file is
accessed.  Until now I haven't find a better method to do it.  So the
hot tracking feature is very useful for me.  That is why I concern the
overhead that the feature brings, and why I hope this feature can be
enabled/disabled dynamically.

As I said, we can write a userspace program to do these things, but my
method has some defects.  On one hand, we couldn't export the layout of
all servers because it uses a huge number of disk spaces.  OTOH, when
I need to use my own method to track a problem, I need to export the
layout, and it takes a long time to do this thing.  Maybe the problem
disappears after you finish exporting the layout. :-(

Thus, writting some daemon program might be another solution, but for
me it is not the best solution.  At least we need to do something in
kernel to record IO activity in order that the user can easily retrieve
it.

Regards,
Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html