Re: Policy based object tiering in RGW

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Mon, 2 Apr 2018 22:57:03 +0000

On Mon, Apr 02, 2018 at 10:28:20PM +0000, Sage Weil wrote:
> On Mon, 2 Apr 2018, Robin H. Johnson wrote:
> > At the core of it, it's a bottom-most rung for tiering, that puts just
> > the data of an RADOS object somewhere EXTERNAL to Ceph. The metadata esp
> > OMAP would remain in Ceph.
...
> The issue I come back to is that we have a layer of metadata and 
> indirection above RADOS that we can use for this: CephFS inode could point 
> to the external tier, and RGW's head object or bucket index could do the 
> same.
CephFS & RGW have that indirection presently, but other consumers might
not have it: DT's email-on-RADOS librmb project.

> Doing tiering at this level means that RGW and CephFS can be fully aware 
> of the tiering without having to ask rados about the state of the world.
> 
> More importantly, perhaps, it means that low-level rados ops aren't 
> expected to block for minutes at a time while some slow external tiering 
> machinery does its thing.
Any implicit access to the data would be told that the data is only
available by explicit request. Explicit request e.g. Glacier-restore,
HSM ioctl. The explicit request itself is an asynchronous request to
queue the restore action, and the consumer MUST poll for completion.
This is the only way to prevent accidents like taring up your entire
CephFS causing an implicit restore of everything.

> In order for such tiering to work well, I would expect that RGW and CephFS 
> want to drive when and how data is migrated.  Which means they can do 
> explicit copying and migration.  Is there any value to having RADOS do it 
> independently?  Between rados pools, I think yes; but to tape?  Glacier?  
> Would you *want* to put individual 4MB objects in glacier, or wouldn't you 
> prefer to copy the entire 1GB RGW video object there instead?
Say you want to write out a 1TB RGW object to the external tier quickly:
in most cases this is going to mean doing it in parallel, and having to
break it up again from the 1TB object to smaller units. The 4MB RADOS
object is however reasonable to store individually, and already has
checksum data of the same granularity. The external tier should make the
decision "I have 250k 4MB objects that belong to a 1TB parent, let's
group them onto a single tape or stripe/linear over multiple tapes"

RGW/CephFS would have to send some sort of batching request for this,
but shouldn't otherwise care how the external tier does the work.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@xxxxxxxxxx
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc

Description: Digital signature