Re: Policy based object tiering in RGW

Yehuda Sadeh-Weinraub <ysadehwe@xxxxxxxxxx> · Tue, 3 Apr 2018 17:17:52 -0700

On Mon, Apr 2, 2018 at 3:28 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Mon, 2 Apr 2018, Robin H. Johnson wrote:
>> On Mon, Apr 02, 2018 at 10:58:03AM -0700, Gregory Farnum wrote:
>> > Hmm, it sounds like you're interested in extending the RADOS
>> > cache-tier functionality for this. That is definitely a mistake; we
>> > have been backing off support for that over the past several releases.
>> > Sage has a plan for some "tiering v2" infrastructure (that integrates
>> > with SK Telecom's dedupe work) which might fit with this but I don't
>> > think it has any kind of timeline for completion.
>> Varada & others (Danny al-Graf, Dan @ CERN):
>> I was thinking of the intersection of these FlipCart ideas and tiering
>> v2 for RADOS for some of what we discussed at various parts of
>> Cephalocon:
>>
>> At the core of it, it's a bottom-most rung for tiering, that puts just
>> the data of an RADOS object somewhere EXTERNAL to Ceph. The metadata esp
>> OMAP would remain in Ceph.
>>
>> Various thoughts:
>> - External tier might be an extra copy of the data or just the ONLY copy
>>   of the data.
>> - Possible external tiers: exports to disk, export to tape, export to
>>   another cluster or cloud
>> - RGW: Exposed via S3-Glacier API functions to trigger a copy being
>>   brought back from external tier to disk tier.
>> - CephFS: revitalizing the Hierarchal Storage Management calls like IRIS
>>   XFS to push data of single files out to external (w/ ioctls to trigger
>>   tier transitions).
>> - Both the RGW & CephFS piece need a means to queue & process
>>   transitions asynchronously in both directions:
>>   - RGW lifecycle says 'this object IS old enough to transition to lower
>>       storage class now'
>>   - RGW lifecycle doesn't say if the transition has actually happened
>>       yet.
>> - RGW needs to be aware of the v2-tiering, and able to do explicit
>>   transitions of objects between tiers.
>
> The issue I come back to is that we have a layer of metadata and
> indirection above RADOS that we can use for this: CephFS inode could point
> to the external tier, and RGW's head object or bucket index could do the
> same.
>
> Doing tiering at this level means that RGW and CephFS can be fully aware
> of the tiering without having to ask rados about the state of the world.
>
> More importantly, perhaps, it means that low-level rados ops aren't
> expected to block for minutes at a time while some slow external tiering
> machinery does its thing.
>
> In order for such tiering to work well, I would expect that RGW and CephFS
> want to drive when and how data is migrated.  Which means they can do
> explicit copying and migration.  Is there any value to having RADOS do it
> independently?  Between rados pools, I think yes; but to tape?  Glacier?
> Would you *want* to put individual 4MB objects in glacier, or wouldn't you
> prefer to copy the entire 1GB RGW video object there instead?
>
> There is some value in not reimplementing the same thing at multiple
> layers, but I question whether we want this external-tiering thing at the
> rados layer at all...
>

Yeah. Everything that Sage said. Also, I should note that since rgw
object is not made out of a single rados object, there's the issue of
atomicity. Moving object's data means that you need to first copy all
its rados objects to a different pool, modify its manifest and then
remove all its data from the original pool. Letting RADOS drive it
would just won't work, or would be too complicated to coordinate.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html