Re: Deployment with Xen

David Turner <drakonstein@xxxxxxxxx> · Wed, 14 Feb 2018 15:47:58 +0000

First off to answer your questions about mons, you need to understand that they work in a Paxos Quorum.  What that means is that there needs to be a majority of Mons that agree that they are in charge.  This is why even numbers of mons is a bad idea as they can potentially split themselves in half.  For this case, let's say you have 3 mons.  2 of them need to be up and communicating for them to agree that they can respond to clients.  If the third mon is online, but networking troubles are keeping it from communicating with the other 2 mons, it will realize that it isn't a part of the quorum and will refuse to respond to anyone that asks it questions.  I think there might be some logic for allowing 1 mon to manage the cluster, but I think that works best if the other mons properly shut down informing the other mons that they are going offline so it isn't up to a vote for who is in charge.

Lifecycle of a client and a mon.  When a client first communicates with a Ceph cluster it uses the mon_host setting in its ceph.conf file to know who the mons are.  It goes through the list until it gets one that will authoritatively respond for the cluster and give it the osd map.  Now that it has an osd map it can start communicating with all of the osds in the cluster, reading, writing, mounting, etc.  This is usually where a client stops talking to mons.  As a client is talking with osds, the osds will respond back with updated maps if there are any.  This change was made in the Hammer release of Ceph.  Before that, all map updates were handled by the mons and it was a huge burden on them causing them to prevent a cluster from growing larger than about 1,000 osds because the mons couldn't handle managing the maps for any more osds.  In Hammer, and still happening today, osds started updating each other's osd maps as they communicated with each other.  If anything is confused as to which map to use, they still ask the mon and the mon will tell them the right one.
If a mon goes down, then the rest of the mon_host will be used to know who to contact.  It might fail on a down mon, but it will retry and get to one that is online.  Mons are the keeper of cephx auth keys and map versions, but other than that, they really don't impact performance much.  Everything else is handled by the algorithms in the osd map that tell a client where all objects and osds are in the cluster and the majority of map updates will come from the communication with the osds.

Back to VMs and librbd vs krbd (which is /dev/rbd* devices).  The kernel driver does not have feature parity with Ceph.  Even the latest kernel does not support all Ceph RBD features and you will have to disable them in your cluster.  This disables things like object map which is how Ceph keeps track of which objects do and don't exist in an RBD.  Without object map Ceph has to assume that every object that can exist in an RBD does.  With object map, if you delete an RBD Ceph issues a delete to only the objects that exist, without it Ceph has to attempt to delete every object regardless if it exists.  Checking the used space of an RBD with object map is instant, checking it without object map can take several minutes on RBDs that are only 100GB in size (this is even worse if you are using snapshots as it has to check for every object that can possibly exist on the RBD itself as well as the snapshots).

librbd has feature parity with Ceph as it is updated and the same version as Ceph with every release.  krbd is still trying to implement RBD features released over a year ago.  I prefer to use the Ceph libraries as often as possible, then the fuse drivers (except rbd-fuse because it is slower than dirt), and if I have no other choice then I'll use the kernel drivers.  When it comes to choosing a hypervisor for hosting VMs on RBDs, there is no question in my mind that I would only look at options that use librbd.

On Tue, Feb 13, 2018 at 6:13 PM Egoitz Aurrekoetxea <egoitz@xxxxxxxxxx> wrote:
Hi David!!
Thanks a lot for your answer. But what happens when you have... imagine two monitors or more and one of them becomes unreponsive?. Another one is used after a timeout or... what happens when a client wants to access to some data, needs to query for that (for knowing where the info is) a monitor and does not answer?. A monitor that becomes not responsive is discarded for the following queries of where the data exists in the cluster?.

So saying in some way... you wont use when talking in terms of performance any kind of solution not accessing through librbd?. Is the performance poor or bad when using /dev/rbdX devices mounted?. Or perhaps you say in terms of data integrity?.

I was planning to use Xen with Cepth but after your advine ... 😀. Would you definitively to with KVM?.

Thanks a lot again 😉
Chefs,

Egoitz,

El 13 feb 2018, a las 20:19, David Turner <drakonstein@xxxxxxxxx> escribió:

Monitors are not required for accessing data from the Ceph cluster.  Clients will ask a monitor for a current OSD map and then use that OSD map to communicate with the OSDs directly for all reads and writes.  The map includes the crush map which has all of the information a client needs to know where every object is in the cluster.  Having 3 mons is a good number for small deployments.  5 mons is better for better redundancy in the monitor quorum.  Avoid an even number of mons always.

librbd is definitely the way to go for accessing RBDs for a hypervisor as opposed to fuse or krbd.  For a quick and easy hypervisor using Ceph, I like Proxmox.  It natively has the ability to use KVM with Ceph without having to configure it yourself.  It comes with a nice gui as well to see the console screen for your VMs.  It also has a fairly simple guide to cluster hypervisors together to provide HA support for your VMs.  For larger scale VM deployments, Openstack is probably the way I would go.

On Tue, Feb 13, 2018 at 2:11 PM Egoitz Aurrekoetxea <egoitz@xxxxxxxxxx> wrote:
Good afternoon,

As I'm new to Ceph I was wondering what could be the most proper way to

use it with Xen hypervisor (with a plain Linux installation, Centos, for

instance). Have read the less proper one is to just

mount the /dev/rbdX device in a mount point and just showing that space

to the Hypervisor but I see it pretty easy and seems stable. Seems not

to perform bad... Is it better to use for instance librbd

with KVM?. Does it perform better?.

By the way, it seems to use the monitor node in order to access to the

space in the osd cluster. Have read too that Ceph has been designed

keeping in mind no single points of failure but... is it possible

to configure several monitor nodes, and then after a very little timeout

or similar to access to the file system through the other nodes?. What

could be the most proper way of configuring this for avoiding a

machine to loose the storage if the monitor fails?. Could you point

please me in the right direction?. Perhaps with several monitors or....

By the way if you could consider it would be better to use another

hypervisor or config (with librados or whatever) with Ceph, could you

please suggest me too?. Help to the newbie :p :) :)

Best regards,

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com