Re: [Patch 0/4] RFC : Support for data gradation of a single file.

Sayan Ghosh <sgdgp.2014@xxxxxxxxx> · Tue, 10 Apr 2018 15:16:57 +0530

Hello,

Thank you Andreas and Theodore for taking time in reviewing the
patchset and also for providing comments and suggestions.
I am describing the problem statement in this mail.

The goal of our project is broadly to support data gradation of a
single file. If the contents of the file is graded in terms of its
importance then a corresponding application might need to view/analyse
only the important portions. It also helps if the important portions
can be accessed quickly without having to go through the entire file.
For an example, we can think of a leaning video with
indexing/annotations, in which the annotations contain the important
parts of the video. A learner can just be interested in those parts,
and it will help him if he can be provided with a reduced view with
just the parts he’s interested in. An example of such videos is ACM
Webinar videos where an user can navigate using table-of-contents or
phrase cloud.

The below link is one similar video -
https://videoken.com/video-detail?videoID=IpGxLWOIZy4&videoDuration=1853&videoName=A%20Friendly%20Introduction%20to%20Machine%20Learning&keyword=A%20Friendly%20Introduction%20to%20Machine%20Learning

There’s a word-cluster associated with the video, and upon clicking on
a word the red-black arrowheads (down) point to the portions where the
word had been used. A more sophisticated version of the same would be
to provide the user a complete reduced clipping with the annotated
portions of the word cluster, rather than the user having to manually
click on the portions he’s interested in.

These kind of video file can serve as an input to our system where we
know which parts of the file has been marked. Our goal then is to
properly place respective important blocks and provide a reduced view
of just the important parts of the file. Placing the important blocks
in a faster tier (SSD,PM etc) greatly enhances the performance of
reading and writing of the file.

As stated above, we are interested in providing a reduced view of a
single file where important and unimportant portions are interspersed
- hence splitting it in two filesystems with important and unimportant
parts would not serve our objective. Let’s say in the example, an user
wants the full view of the video. In this case splitting the video in
two filesystems would not be ideal, as the user needs to be provided
with both important and unimportant blocks. Creating a sparse layout
to overlay two files will unnecessarily be complicated. It’ll hence be
ideal if a file has those graded information as a metadata (extended
attributes in our case), and use those information to properly place
and fetch when necessary.

Regards,
Sayan Ghosh

‌On Mon, Apr 9, 2018 at 9:33 AM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
> On Apr 6, 2018, at 4:27 PM, Theodore Y. Ts'o <tytso@xxxxxxx> wrote:
>> The other thing to consider is whether it makes any sense at all to
>> solve this problem by haing a single file system where part of the
>> storage is DAX, and part is not.  Why not just have two file systems,
>> one which is 100% DAX, and another which is 100% HDD/SSD, and store
>> the data in two files in two different file systems?
>
> I think there definitely *are* benefits to having both flash and HDDs
> (and/or other different storage classes such as RAID-10 and RAID-6) in
> the same filesystem namespace.  This is the premise behind bcache,
> XFS realtime volumes, Btrfs, etc.
>
> That said, having a hard-coded separation of flash vs. disks does not
> make sense, even from an intermediate development point of view.  There
> definitely should be a block-device interface for querying what the
> actual layout is, perhaps something like the SMR zones?
>
> Alternately, ext4 could add something akin to the realtime volume in
> XFS, where it can directly address multiple storage devices to handle
> different storage classes, but that would need at least some amount of
> development.  It was actually one of the options on the table for the
> early ext2resize development, to split the ext4 block groups across
> devices and then concatenate them logically at runtime.  That would
> allow e.g. some number of DAX block groups, NVMe block groups, and HDD
> RAID-6 block groups all in the same filesystem.  Even then, there would
> need to be some way for ext4 to query the storage type of the underlying
> devices, so that these could be mapped to the lifetime hints.
>
> Cheers, Andreas
>
>
>
>
>

</tytso@xxxxxxx></adilger@xxxxxxxxx>