Re: Policy based object tiering in RGW

"Varada Kari (System Engineer)" <varadaraja.kari@xxxxxxxxxxxx> · Tue, 3 Apr 2018 08:53:10 +0530

On Tue, Apr 3, 2018 at 3:58 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Mon, 2 Apr 2018, Robin H. Johnson wrote:
>> On Mon, Apr 02, 2018 at 10:58:03AM -0700, Gregory Farnum wrote:
>> > Hmm, it sounds like you're interested in extending the RADOS
>> > cache-tier functionality for this. That is definitely a mistake; we
>> > have been backing off support for that over the past several releases.
>> > Sage has a plan for some "tiering v2" infrastructure (that integrates
>> > with SK Telecom's dedupe work) which might fit with this but I don't
>> > think it has any kind of timeline for completion.
>> Varada & others (Danny al-Graf, Dan @ CERN):
>> I was thinking of the intersection of these FlipCart ideas and tiering
>> v2 for RADOS for some of what we discussed at various parts of
>> Cephalocon:
>>
>> At the core of it, it's a bottom-most rung for tiering, that puts just
>> the data of an RADOS object somewhere EXTERNAL to Ceph. The metadata esp
>> OMAP would remain in Ceph.
>>
>> Various thoughts:
>> - External tier might be an extra copy of the data or just the ONLY copy
>>   of the data.
>> - Possible external tiers: exports to disk, export to tape, export to
>>   another cluster or cloud
>> - RGW: Exposed via S3-Glacier API functions to trigger a copy being
>>   brought back from external tier to disk tier.
>> - CephFS: revitalizing the Hierarchal Storage Management calls like IRIS
>>   XFS to push data of single files out to external (w/ ioctls to trigger
>>   tier transitions).
>> - Both the RGW & CephFS piece need a means to queue & process
>>   transitions asynchronously in both directions:
>>   - RGW lifecycle says 'this object IS old enough to transition to lower
>>       storage class now'
>>   - RGW lifecycle doesn't say if the transition has actually happened
>>       yet.
>> - RGW needs to be aware of the v2-tiering, and able to do explicit
>>   transitions of objects between tiers.
>
> The issue I come back to is that we have a layer of metadata and
> indirection above RADOS that we can use for this: CephFS inode could point
> to the external tier, and RGW's head object or bucket index could do the
> same.
>
Yeah, I am also thinking on the same lines of using the omap meta to
indicate where the object is.

> Doing tiering at this level means that RGW and CephFS can be fully aware
> of the tiering without having to ask rados about the state of the world.
>
I am looking at two kinds of tiering here.

one, within the cluster, where i can have policy on the bucket/object
saying "i want to move this data to a
colder tier after a month. (I am okay to be latent once the object is
moved to a different tier, which is inherent here)".
Here if i am using one of the placement targets to specify initial
write should place the object
in a ssd tier, my policy suggest me to move this object to a colder
tier after a month. Here i want to use tiering(V1 or V2) and move data
around the pools or different crush device classes.
Not all buckets can be archived, but they are not accessed frequently,
so want some internal tiering. If the objects written to the bucket
are big enough, like videos etc.., we can use a pool with EC with
better k+m.

Similarly, i can have one more policy saying," I want to archive this
data after 3 months may be to a different cluster, but when needed, i
am okay to have latency on read".
This archival can be within the cluster or can be a different cluster.
Here i want to use cloud sync and other techniques to move the data
across clusters.
In this scenario, i will not be overloading rados, depend completely
on the RGW functionality.

> More importantly, perhaps, it means that low-level rados ops aren't
> expected to block for minutes at a time while some slow external tiering
> machinery does its thing.
>
Sure. For an external sync, it is not valid scenario. But for internal
tiering(to different pools),
can overload the existing infra whether it is tiering V1 or V2.

> In order for such tiering to work well, I would expect that RGW and CephFS
> want to drive when and how data is migrated.  Which means they can do
> explicit copying and migration.  Is there any value to having RADOS do it
> independently?  Between rados pools, I think yes; but to tape?  Glacier?
> Would you *want* to put individual 4MB objects in glacier, or wouldn't you
> prefer to copy the entire 1GB RGW video object there instead?
>
I am actually not looking at backing to tape or another cluster using
rados. But it would be a interesting plugin to RGW
if we want to write/archive the data to a tape.  For glacier kind if
implementation, want to use the cloud sync.

Varada
> There is some value in not reimplementing the same thing at multiple
> layers, but I question whether we want this external-tiering thing at the
> rados layer at all...
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html