On Sat, Jan 22, 2011 at 3:06 AM, Matthias Urlichs <matthias@xxxxxxxxxx> wrote: > Hi, >> * I don't think it's possible to control where clients go for reads - >> Ceph is pretty much optimized to the case where all nodes are in a >> single datacenter, over a more or less homogeneous network. > > So, what piece(s) of code decides which replica is read from? Clients always read from the primary OSD housing the data, for consistency purposes. We're working on implementing read-from-replicas at the librados level, but anybody doing that is probably going to have to manage their own consistency guarantees. We haven't discussed implementing it in Ceph at all yet >> For writes, >> though, you'd be stuck with WAN speeds no matter what, because the data >> has to go out to all replicas before the writes complete. > > Hmm. Too bad; I'd be more than happy with writing to one or two replicas, > and trusting CEPH to manage the rest of the copying in the background. Unfortunately there's no provision for asynchronous replication in Ceph's protocols -- they just don't fit within Ceph's overall design. After all, with asynchronous replication, you don't have the right number of data copies at all times. This is something that's unlikely to change. > My situation is this: my Ceph cluster is distributed over multiple > sites. The links between sites are rather slow. :-/ > > Storing one copy of a file at each site should not be a problem with > a reasonable crushmap, but ..: > > * how can I verify on which devices a file is stored? > > * is it possible to teach clients to read/write from "their", i.e. > the local site's, copy of a file, instead of pulling stuff from > a remote site? Or does ceph notice the speed difference by itself? >From what I've read the best (only?) solution that's designed for this situation is xtreemfs, and it's my standard recommendation. I've not heard back if it actually works from people who ask about it, though.' That said, depending on your exact needs there are one or two possible solutions with Ceph. If most of your data spends most of its life in one data center, you could set up an OSD pool that lives in each data center and set the appropriate parts of the filesystem to use the appropriate pool (you can specify default layouts, which include the pool, on directories that then apply to the hierarchy tree they root). You'd have to manage off-site backups yourself in this case, perhaps via something nasty like rsyncing across the FS at night? Then the current copy of the data would always be available (albeit at slow speed) from anywhere and you'd have local backups and off-site nightlies. I'm not sure how the metadata cluster would handle this in terms of dividing authority intelligently, but hopefully it's smart enough or could be adjusted to such reasonably easily. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html