RE: [Patch 0/4] RFC : Support for data gradation of a single file.

"Bhattacharya, Suparna" <suparna.bhattacharya@xxxxxxx> · Wed, 11 Apr 2018 09:20:50 +0000

Hi Andreas,

> -----Original Message-----
> From: Andreas Dilger [mailto:adilger@xxxxxxxxx]
> Sent: Wednesday, April 11, 2018 12:10 AM
> To: Sayan Ghosh <sgdgp.2014@xxxxxxxxx>
> Cc: Theodore Y. Ts'o <tytso@xxxxxxx>; Ext4 Developers List <linux-
> ext4@xxxxxxxxxxxxxxx>; Linux FS Devel <linux-fsdevel@xxxxxxxxxxxxxxx>;
> Bhattacharya, Suparna <suparna.bhattacharya@xxxxxxx>; niloy ganguly
> <ganguly.niloy@xxxxxxxxx>; Madhumita Mallick
> <madhu.cse.ju@xxxxxxxxx>; Bharde, Madhumita
> <madhumita.bharde@xxxxxxx>
> Subject: Re: [Patch 0/4] RFC : Support for data gradation of a single file.
> 
> On Apr 10, 2018, at 3:46 AM, Sayan Ghosh <sgdgp.2014@xxxxxxxxx>
> wrote:
> >
> > Hello,
> >
> > Thank you Andreas and Theodore for taking time in reviewing the
> > patchset and also for providing comments and suggestions.
> > I am describing the problem statement in this mail.
> >
> >
> > The goal of our project is broadly to support data gradation of a
> > single file. If the contents of the file is graded in terms of its
> > importance then a corresponding application might need to view/analyse
> > only the important portions. It also helps if the important portions
> > can be accessed quickly without having to go through the entire file.
> > For an example, we can think of a leaning video with
> > indexing/annotations, in which the annotations contain the important
> > parts of the video. A learner can just be interested in those parts,
> > and it will help him if he can be provided with a reduced view with
> > just the parts he’s interested in. An example of such videos is ACM
> > Webinar videos where an user can navigate using table-of-contents or
> > phrase cloud.
> >
> > The below link is one similar video -
> > https://videoken.com/video-
> detail?videoID=IpGxLWOIZy4&videoDuration=1853&videoName=A%20Frie
> ndly%20Introduction%20to%20Machine%20Learning&keyword=A%20Frien
> dly%20Introduction%20to%20Machine%20Learning
> >
> >
> > There’s a word-cluster associated with the video, and upon clicking on
> > a word the red-black arrowheads (down) point to the portions where the
> > word had been used. A more sophisticated version of the same would be
> > to provide the user a complete reduced clipping with the annotated
> > portions of the word cluster, rather than the user having to manually
> > click on the portions he’s interested in.
> >
> > These kind of video file can serve as an input to our system where we
> > know which parts of the file has been marked. Our goal then is to
> > properly place respective important blocks and provide a reduced view
> > of just the important parts of the file. Placing the important blocks
> > in a faster tier (SSD,PM etc) greatly enhances the performance of
> > reading and writing of the file.
> >
> > As stated above, we are interested in providing a reduced view of a
> > single file where important and unimportant portions are interspersed
> > - hence splitting it in two filesystems with important and unimportant
> > parts would not serve our objective. Let’s say in the example, an user
> > wants the full view of the video. In this case splitting the video in
> > two filesystems would not be ideal, as the user needs to be provided
> > with both important and unimportant blocks. Creating a sparse layout
> > to overlay two files will unnecessarily be complicated. It’ll hence be
> > ideal if a file has those graded information as a metadata (extended
> > attributes in our case), and use those information to properly place
> > and fetch when necessary.
> 
> To my thinking, you're always going to have more complex metadata for
> the file stored in some kind of external database or a separate index
> file.  You're not going to get the filesystem and all filesystem tools
> to understand the full "importance of this extent" metrics, as that is
> going to be different for each application, so storing a single bit of
> "importance" for every block in the filesystem is not very helpful and
> you may as well just rely on the external database/index file for this.
> 

You have a point there. We wouldn't want to clutter the fs with all kinds of complex application specific metadata interpretation. 
However, the simplicity of accessing a reduced view of the file with existing interfaces is rather appealing. It also provides a natural way to drive hints to optimize not just layout but other things such as readahead decisions ... as it is good clue of what data apps would access / need and even a way to shape what they access instead of having them pull in data won't be useful (while still preserving the ability to see and retain the full view).

As Sayan observed, layout hints can't guarantee where data will be placed, so we can't reverse map the view just from the layout. The grade attributes are one way to specify this kind of control plane information from an application view and it is also easy to change the view (without having to force a reorganization on disk). Are there other ways to convey such context (persistently) that would be more broadly useful?  

Another possibility is a snapshot like approach where a second inode has the reduced (high grade) view, but it gets more complex and trickier to preserve across copies / backups etc.

> 
> What you are really interested in is having the ability to provide hints
> for the filesystem block allocator to store in different storage classes
> within the same file, and (potentially) some way to retrieve the current
> storage class upon request.
> 
> That said, the first part (requesting specific storage classes during
> write) could be achieved by enhancing the StreamID/Lifetime patches to
> allow specifying different hints for each write.  I think this had been
> proposed at one time, but there wasn't any proposed use case for having
> different storage classes within the same file, but now there is.
> 
> As for the interface for determining how the file is currently laid out,
> I think that the FIEMAP ioctl could potentially be used for this.  It
> will tell you the block number for each extent of the file, which could
> be mapped to a different storage class if you are doing the mapping game
> with LVM.  It is also possible to have FIEMAP also return the device to
> the caller (as Lustre does) if the filesystem can manage multiple devices.
> I think that would be useful for XFS (realtime volume), BtrFS (can use
> multiple devices directly), and potentially ext4 if someone added the
> ability to use multiple devices directly.
> 
> Cheers, Andreas
> 
> 
> > ‌On Mon, Apr 9, 2018 at 9:33 AM, Andreas Dilger <adilger@xxxxxxxxx>
> wrote:
> >> On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@xxxxxxx> wrote:
> >>> The other thing to consider is whether it makes any sense at all to
> >>> solve this problem by haing a single file system where part of the
> >>> storage is DAX, and part is not.  Why not just have two file systems,
> >>> one which is 100% DAX, and another which is 100% HDD/SSD, and
> store
> >>> the data in two files in two different file systems?
> >>
> >> I think there definitely *are* benefits to having both flash and HDDs
> >> (and/or other different storage classes such as RAID-10 and RAID-6) in
> >> the same filesystem namespace.  This is the premise behind bcache,
> >> XFS realtime volumes, Btrfs, etc.
> >>
> >> That said, having a hard-coded separation of flash vs. disks does not
> >> make sense, even from an intermediate development point of view.
> There
> >> definitely should be a block-device interface for querying what the
> >> actual layout is, perhaps something like the SMR zones?
> >>
> >> Alternately, ext4 could add something akin to the realtime volume in
> >> XFS, where it can directly address multiple storage devices to handle
> >> different storage classes, but that would need at least some amount of
> >> development.  It was actually one of the options on the table for the
> >> early ext2resize development, to split the ext4 block groups across
> >> devices and then concatenate them logically at runtime.  That would
> >> allow e.g. some number of DAX block groups, NVMe block groups, and
> HDD
> >> RAID-6 block groups all in the same filesystem.  Even then, there would
> >> need to be some way for ext4 to query the storage type of the
> underlying
> >> devices, so that these could be mapped to the lifetime hints.
> >>
> >> Cheers, Andreas
> >>
> >>
> >>
> >>
> >>
> >
> > </tytso@xxxxxxx></adilger@xxxxxxxxx>
> 
> 
> Cheers, Andreas
> 

Regards
Suparna

> 
>