Re: Ceph distributed over slow link: possible?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 21, 2011 at 11:55 AM, Matthias Urlichs <matthias@xxxxxxxxxx> wrote:
> Hello ceph people,
>
> My situation is this: my Ceph cluster is distributed over multiple
> sites. The links between sites are rather slow. :-/
>
> Storing one copy of a file at each site should not be a problem with
> a reasonable crushmap, but ..:
>
> * how can I verify on which devices a file is stored?
>
> * is it possible to teach clients to read/write from "their", i.e.
> Âthe local site's, copy of a file, instead of pulling stuff from
> Âa remote site? Or does ceph notice the speed difference by itself?
>
> * My crushmap looks like this:
> type 0 Âdevice
> type 1 Âhost
> type 2 Âsite
> type 3 Âroot
> ... (root => 2 sites => 2 hosts each => 3 devices each)
> rule Âdata {
> Â Â Â Âruleset 0
> Â Â Â Âtype replicated
> Â Â Â Âmin_size 2
> Â Â Â Âmax_size 2
> Â Â Â Âstep take Âroot
> Â Â Â Âstep chooseleaf firstn 2 type site
> Â Â Â Âstep emit
> }
>
> but when only one site is reachable, will there be one or two
> copies of a file? If the former, how do I fix that? If the latter,
> will the copy be redistributed when (the link to) the second site
> comes back?
>

Not an expert by any stretch, but here goes:

* I don't know of a solid way to verify where your file data is going,
but if you just want to test your replication strategy, you can write
a large file to the cluster and see which OSDs grow. (OSDs store data
in a file-based structure, so 'df' on your storage nodes will actually
give you an accurate account of where the space is used.)

* I don't think it's possible to control where clients go for reads -
Ceph is pretty much optimized to the case where all nodes are in a
single datacenter, over a more or less homogeneous network. For
writes, though, you'd be stuck with WAN speeds no matter what, because
the data has to go out to all replicas before the writes complete. So
unless your workload is very read-heavy, your performance would still
suffer if Ceph could read from the closest replica.

* With that crushmap, there'll be only one copy of your data at each
site. If you want higher replication, I think you have to put another
layer in there, something like:

min_size 4
max_size 8
step take  root
step choose firstn 0 type site
step chooseleaf firstn 2 type host
step emit

'chooseleaf' goes straight down to the device level, so we only want
to use it in the last rule. Choosing 0 in a rule selects all that are
available, so this should distribute two copies of your data at each
of your sites, even if you add more sites later on. (min_size and
max_size are updated accordingly.) CRUSH is a pretty neat system; you
could probably get fancier with the data placement rules if you want.

Again: I'm hardly an expert on this, so I'm hoping that people with
more experience will come along and correct whatever glaring errors
I've made. :)

--Ravi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux