We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested. If Mark and Greg can provide some feedback,that would be great. We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules. When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration 发自我的 iPhone 在 2013-4-13,0:20,"Gregory Farnum" <greg@xxxxxxxxxxx> 写道: > I was in the middle of writing a response to this when Mark's email > came in, so I'll just add a few things: > > On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote: >> On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >>> >>> As I understand it, in Ceph one can cluster storage nodes, but otherwise >>> every node is essentially identical, so if three storage nodes have a >>> file, >>> ceph randomly uses one of them. >> >> >> Ceph clusters have the concept of pools, where each pool has a certain >> number of placement groups. Placement groups are just collections of >> mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, >> based on the replication level you set when you make the pool. When an >> object gets written to the cluster, CRUSH will determine which PG the data >> should be sent to. The data will first hit the primary OSD and then >> replicated out to the other OSDs in the same placement group. >> >> Currently reads always come from the primary OSD in the placement group >> rather than a secondary even if the secondary is closer to the client. I'm >> guessing there are probably some tricks that could be played here to best >> determine which machines should service which clients, but it's not exactly >> an easy problem. In many cases spreading reads out over all of the OSDs in >> the cluster is better than trying to optimize reads to only hit local OSDs. >> Ideally you probably want to prefer local OSDs first, but not exclusively. > > In addition to just determining the locality (which we've started on > via external interfaces), this has a number of consistency challenges > associated with it. The infrastructure we have to allow reading from > non-primaries tends to involve clients having different consistency > expectations, and it's not fully explored yet or set up so that > clients can choose to read from a specific non-primary ― the options > currently are "local if available and we can tell", "random", and > "primary". > > >>> This is not efficient use of network resources in a distributed data >>> center. >>> Or even in a multi-rack situation. >>> >>> I want to prefer accessing nodes which are "local". >>> The client in rack A should prefer to read from the storage nodes that are >>> also in rack A. >>> Ditto for rack B. >>> Ditto for s/rack/data center/. > > I do want to ask if you're sure this is as useful as you think it is. > There are use cases where it would be, but since writes have to > traverse these links (at a multiple of the actual write count) as well > you should be very certain. :) > >>> As far as I understand, the Ceph clients can't do that. >>> (Nor can Ceph nodes among each other, but I care less about that, as most >>> traffic is reading data.) >>> >>> I think this is an important feature for many high-reliability situations. >>> >>> What would be the next steps to get this feature, assuming I don't have >>> time >>> to implement it myself? Persistently annoy this mailing list that people >>> need it? Offer to pay for implementing it? Shut up and look for some other >>> solution -- which I already did, but I didn't find any that's as good as >>> Ceph, otherwise? >> >> >> I don't really have that much insight into the product roadmap, but I assume >> that if you spoke to some of our business folks about paying for development >> work you'd at least get a response. > > Yeah. It's not a feature in large enough demand right now that we can > see to be worth bumping up over other things, but I don't think > anybody's opposed to it existing. As with Mark I have no idea if > you're best off asking us or others to do things for money, but it > would certainly have to go through business channels. (If somebody > outside Inktank did want to implement this feature, I'd love to talk > to them about it on an informal but ongoing basis during development.) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f