avoiding data migration for cold rados pool

Sage Weil <sage@xxxxxxxxxxx> · Wed, 29 Oct 2014 11:10:54 -0700 (PDT)

Hi Matthias, Marcel,

Thanks for the discussion yesterday on a cold storage tier[1].  I think 
the idea of minimizing/preventing migration of data for a particular RADOS 
pool while still minimizing placement metadata (i.e., no MDS) is very 
interesting and has lots of possible applications.  I thought I'd 
summarize what I was suggesting yesterday in case it didn't come across 
well verbally.

The basic tradeoff is around placement metadata.  In RADOS we have a 
compact OSDMap structure on the client that lets you calculate where any 
object is based on a simple policy (pool properties, CRUSH map) and OSD 
state (up/down/in/out, current IP address).  If placement of an object 
is purely a function of the name and the cluster state, then generally 
when the cluster changes data will move.

To avoid that, my suggestion is to incorporate a timestamp into part of 
the name (say, a prefix).  Then placement becomes a function of an 
arbitrary string, time written (which together form the object name), and 
cluster state.  This would normally mean a metadata layer so that you can 
tell that 'foo' was written at time X and is actually 'X_foo'.  But, if we 
combine it with the proposed RADOS redirect mechanism, then the active 
storage tier would have a zillion pointers (stored as 'foo') that point 
off into some cold tier with the correct name ('X_foo').  Basically, 
another RADOS pool becomes that metadata layer.  At that point it needn't 
even be 'X_foo'.. it could be X-anything, as long as it is unique and has 
the timestamp X in there to inform placement.

For the placement thing, my suggestion is to look at the basic idea behind 
the original RUSH-L algorithm (reimplemented as CRUSH list buckets), 
originally described in this paper

http://pdf.aminer.org/000/409/291/replication_under_scalable_hashing_a_family_of_algorithms_for_scalable.pdf

The core idea is that at a point in time, data is distributed in a 
particular way.  In the base case, we just hash/stripe over a set of 
identical nodes.  Each time we deploy new gear, we "patch" the previous
distribution, so that some % of objects are instead placed on the new 
gear.  This approach has various flaws, mainly when it comes to removing 
old gear, but I think the idea of patching the previous distirbution can 
be applied here.  

Currently, we do:

 object_name
 hash(object_name) % pg_num -> ps (placement seed)
 (ps, poolid) -> pgid
 crush(pgid) -> [set of osds]

Here, we could define a series of time intervals, and for each interval, 
we would create a new set of PGs.  More like:

 (name, timestamp)
 hash(object_name) % interval_pg_num -> ps
 (ps, poolid, interval #) -> tpgid
 crush(tpgid) -> [set of osds]

The trick would be that for each time interval, CRUSH would define how the 
objects distribute.  When they fill or new hardware is deployed, we'd 
close out the current interval and start a new interval that mapped to 
new PGs that CRUSH mapped to new hardware.

Hmm, you could actually do this by simply creating a new RADOS pool for 
every interval and not changing anything in the existing code at all.  As 
HW in old pools fails you'd have to include some new HW in the mix to 
offload some content.  There are probably some changes we can do 
there to avoid writing anything new to the surviving full nodes (that case 
is awkward to handle currently).  Or, there may be benefits to pulling 
this functionality into a new approach within CRUSH.. I'm not sure.  
Would need to think about it a bit more ...

sage

[1]
 http://pad.ceph.com/p/hammer-cold_storage
 https://wiki.ceph.com/Planning/Blueprints/Hammer/Towards_Ceph_Cold_Storage
 http://youtu.be/FARNRvYMQJ4
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html