Re: ceph block pointer abstractions

Fabio Kaminski <fabiokaminski@xxxxxxxxx> · Wed, 27 Apr 2011 17:40:38 -0300

Thanks Greg, it looks that you have a fine granularity , even
providing the partition scheme, as its the case with rados.

i will take your advice and try to experiment, without any clustering
optimizations (at least not at application level) for my use case..
and get deeper if i need to..

its nice to know that you can have this kind of control, when you need it..

(you answer is a good post for the wiki, very elucidative and concise)

thanks.

On Wed, Apr 27, 2011 at 5:02 PM, Gregory Farnum
<gregory.farnum@xxxxxxxxxxxxx> wrote:
> On Wednesday, April 27, 2011 at 12:42 PM, Fabio Kaminski wrote:
> Ok, thats the way it should be.. :)
>>
>> But specializing a little more the question.. whats the data
>> partition scheme between nodes, how can a user control it.. block
>> level? file level?
>> suppose that i have agregations that i want it to stick together in
>> nodes... even if replicated in several nodes.. but always together, to
>> get network roundtrips diminished..
>
> The posix-compliant Ceph filesystem is built on top of the RADOS object store. RADOS places objects into "placement groups" (PGs), and puts these PGs on OSD storage nodes using a pseudo-random hash algorithm called CRUSH. Generally this is based on the object's name (in Ceph the names are based on inode numbers and which block of the file it is), although if you are using the RADOS libraries to place data you can specify an alternative string to hash on (look at the object_locator_t bits). This is there so that you can place different pieces of data on the same node, but if you use this interface you'll need to be able to provide the object_locator_t every time you access the object.
> When using Ceph you don't have access to the object_locator_t method of placing data, but you do have some options. Using either the libraries or the cephfs tool you can specify the preferred PG and/or the pool for a file to be placed in. (You can also use the cephfs tool on a directory, so that its placement settings will apply to that directory's subtree).
> Setting the PG will keep data together, and each OSD node has some PGs which are kept local to that OSD whenever it's up if you want local reads/writes (generally these PGs are unused, but they can be valuable for doing things like simulating HDFS behavior). Setting just the pool will let the system properly distribute data, but you can set up the CRUSH map so that the pool always roots at a specific node, or do any of a number of other things to specify how you want the data to be laid out.
> Which interface you want to use for keeping data together depends a lot on your exact use case and your skills.
>
> Lastly, I will say this: it is often the case that trying to keep specific data co-located is not actually worth the trouble. You may have such a use-case and it's always good to have tools that support such things, but eg the Hadoop people have discovered that guaranteeing local writes is not actually helpful to performance in most situations. Before going through the hassle of setting up such a thing I'd be very sure that it matters to you and that the default placement is unacceptable.
> -Greg
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html