Re: ceph and efficient access of distributed resources

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I was in the middle of writing a response to this when Mark's email
came in, so I'll just add a few things:

On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
> On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
>>
>> As I understand it, in Ceph one can cluster storage nodes, but otherwise
>> every node is essentially identical, so if three storage nodes have a
>> file,
>> ceph randomly uses one of them.
>
>
> Ceph clusters have the concept of pools, where each pool has a certain
> number of placement groups.  Placement groups are just collections of
> mappings to OSDs.  Each PG has a primary OSD and a number of secondary ones,
> based on the replication level you set when you make the pool. When an
> object gets written to the cluster, CRUSH will determine which PG the data
> should be sent to.  The data will first hit the primary OSD and then
> replicated out to the other OSDs in the same placement group.
>
> Currently reads always come from the primary OSD in the placement group
> rather than a secondary even if the secondary is closer to the client. I'm
> guessing there are probably some tricks that could be played here to best
> determine which machines should service which clients, but it's not exactly
> an easy problem.  In many cases spreading reads out over all of the OSDs in
> the cluster is better than trying to optimize reads to only hit local OSDs.
> Ideally you probably want to prefer local OSDs first, but not exclusively.

In addition to just determining the locality (which we've started on
via external interfaces), this has a number of consistency challenges
associated with it. The infrastructure we have to allow reading from
non-primaries tends to involve clients having different consistency
expectations, and it's not fully explored yet or set up so that
clients can choose to read from a specific non-primary — the options
currently are "local if available and we can tell", "random", and
"primary".


>> This is not efficient use of network resources in a distributed data
>> center.
>> Or even in a multi-rack situation.
>>
>> I want to prefer accessing nodes which are "local".
>> The client in rack A should prefer to read from the storage nodes that are
>> also in rack A.
>> Ditto for rack B.
>> Ditto for s/rack/data center/.

I do want to ask if you're sure this is as useful as you think it is.
There are use cases where it would be, but since writes have to
traverse these links (at a multiple of the actual write count) as well
you should be very certain. :)

>> As far as I understand, the Ceph clients can't do that.
>> (Nor can Ceph nodes among each other, but I care less about that, as most
>> traffic is reading data.)
>>
>> I think this is an important feature for many high-reliability situations.
>>
>> What would be the next steps to get this feature, assuming I don't have
>> time
>> to implement it myself? Persistently annoy this mailing list that people
>> need it? Offer to pay for implementing it? Shut up and look for some other
>> solution -- which I already did, but I didn't find any that's as good as
>> Ceph, otherwise?
>
>
> I don't really have that much insight into the product roadmap, but I assume
> that if you spoke to some of our business folks about paying for development
> work you'd at least get a response.

Yeah. It's not a feature in large enough demand right now that we can
see to be worth bumping up over other things, but I don't think
anybody's opposed to it existing. As with Mark I have no idea if
you're best off asking us or others to do things for money, but it
would certainly have to go through business channels. (If somebody
outside Inktank did want to implement this feature, I'd love to talk
to them about it on an informal but ongoing basis during development.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux