Re: Basic questions

Hariharan Thantry <thantry@xxxxxxxxx> · Fri, 26 Jul 2013 13:41:43 -0700

John,
Thanks for the really insightful responses!

It would be nice to know what the dominant deployment scenario for the native case (my question (c)). 
Do they usually end up with something like OCFS2 on top of RBD for the native case, or do they go with the CephFS?

Thanks,
Hari

On Fri, Jul 26, 2013 at 12:59 PM, John Wilkins <john.wilkins@xxxxxxxxxxx> wrote:

(a) This is true when using ceph-deploy for a cluster. It's one Ceph

Monitor for the cluster on one node. You can have many Ceph monitors,

but the typical high availability cluster has 3-5 monitor nodes. With

a manual install, you could conceivably install multiple monitors onto

a single node for the same cluster, but this isn't a best practice

since the node is a failure domain. The monitor is part of the

cluster, not the node. So you can have thousands of nodes running Ceph

daemons that are members of the cluster "ceph." A node that has a

monitor for cluster "ceph" will monitor all Ceph OSD daemons and MDS

daemons across those thousands of nodes. That same node could also

have a monitor for cluster "deep-storage" or whatever cluster name you

choose.

(b) I'm actually working on a reference architecture for Calxeda that

is asking exactly that question. My personal feeling is that having a

machine/host/chassis optimized for a particular purpose (e.g., running

Ceph OSDs) is the ideal scenario, since you can just add hardware to

the cluster to expand it. You don't need to add monitors or MDSs to

add OSDs. So my personal opinion is that it's an ideal approach. The

upcoming Calxeda offerings provide excellent value in the

cost/performance tradeoff. You get a lot of storage density and good

performance. High performance clusters--e.g., using SSDs for journals,

having more RAM and CPU power--cost more, but you still have some of

the same issues. I still don't have a firm opinion on this, but my gut

tells me that OSDs should be separate from the other daemons--build

OSD hosts with dense storage. The fsync issues with the

kernel--running monitors and OSDs on the same host--generally lead to

performance issues. See

http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#osds-are-slow-unresponsive

for examples of why you may run into performance issues making

different types of processes co-resident on the same host.  Processes

like monitors shouldn't be co-resident with OSDs. So you don't have

wasted hosts with light weight processes like Ceph monitors, it may be

ideal to place your MDS daemons, Apache/RGW daemons,

OpenStack/CloudStack, and/or VMs on those nodes. You need to consider

the CPU, RAM, disk i/o and network implications of co-resident

applications.

(d) If you have three monitors, Paxos will still work. 2 out of 3

monitors is a majority. A failure of a monitor means it's down, but

not out. If it were out of the cluster, then the cluster would assume

only two monitors, which wouldn't work with Paxos. That's why 3

monitors is the minimum for high availability. 4 works too, because 3

out of 4 is a majority too. Some people like using an odd number of

monitors, since you never have an equal number of monitors that are

up/down; however, this isn't a requirement for Paxos. 3 out of 4 and 3

out of 5 both constitute a majority.

On Fri, Jul 26, 2013 at 11:29 AM, Hariharan Thantry <thantry@xxxxxxxxx> wrote:

> Hi John,

>

> Thanks for the responses.

>

> For (a), I remember reading somewhere that one can only run a max of 1

> monitor/node, I assume that that implies the single monitor process will be

> responsible for ALL ceph clusters on that node, correct?

>

> So (b) isn't really a Ceph issue, that's nice to know. Any recommendations

> on the minimum kernel/glibc version and min RAM size requirements where Ceph

> can be run on a single client in native mode? Reason I ask this is in a few

> deployment scenarios (especially non-standard like telco platforms),

> hardware gets added gradually, so its more important to be able to scale the

> cluster out gracefully. I actually see Ceph as an alternative to SAN, using

> JBODs from machines to create a larg(ish) storage cluster. Plus, usually,

> the clients would probably be running on the same hardware as the OSD/MON,

> because space on the chassis is at a premium.

>

> (d) I was thinking about single node failure scenarios, with 3 nodes,

> wouldn't a failure of 1 node cause PAXOS to not work?

>

>

>

> Thanks,

> Hari

>

>

>

>

>

> On Fri, Jul 26, 2013 at 10:00 AM, John Wilkins <john.wilkins@xxxxxxxxxxx>

> wrote:

>>

>> (a) Yes. See

>> http://ceph.com/docs/master/rados/configuration/ceph-conf/#running-multiple-clusters

>> and

>> http://ceph.com/docs/master/rados/deployment/ceph-deploy-new/#naming-a-cluster

>> (b) Yes. See

>> http://wiki.ceph.com/03FAQs/01General_FAQ#How_Can_I_Give_Ceph_a_Try.3F

>>  Mounting kernel modules on the same node as Ceph Daemons can cause

>> older kernels to deadlock.

>> (c) Someone else can probably answer that better than me.

>> (d) At least three. Paxos requires a simple majority, so 2 out of 3 is

>> sufficient. See

>> http://ceph.com/docs/master/rados/configuration/mon-config-ref/#background

>> particularly the monitor quorum section.

>>

>> On Wed, Jul 24, 2013 at 4:03 PM, Hariharan Thantry <thantry@xxxxxxxxx>

>> wrote:

>> > Hi folks,

>> >

>> > Some very basic questions.

>> >

>> > (a) Can I be running more than 1 ceph cluster on the same node (assume

>> > that

>> > I have no more than 1 monitor/node, but storage is contributed by one

>> > node

>> > into more than 1 cluster)

>> > (b) Are there any issues with running Ceph clients on the same node as

>> > the

>> > other Ceph storage cluster entities (OSD/MON?)

>> > (c) Is the best way to access Ceph storage cluster in native mode by

>> > multiple clients through hosting a shared-disk filesystem on top of the

>> > RBD

>> > (like OCFS2?). What if these clients were running inside VMs? Could one

>> > then

>> > create independent partitions on top of rbd and give a partition to each

>> > of

>> > the VMs?

>> > (d) Isn't the realistic minimum for # of monitors in a cluster at least

>> > 4

>> > (to guard against one failure?)

>> >

>> >

>> > Thanks,

>> > Hari

>> >

>> >

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>>

>>

>>

>> --

>> John Wilkins

>> Senior Technical Writer

>> Intank

>> john.wilkins@xxxxxxxxxxx

>> (415) 425-9599

>> http://inktank.com

>

>

--

John Wilkins

Senior Technical Writer

Intank

john.wilkins@xxxxxxxxxxx

(415) 425-9599

http://inktank.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com