Re: Importance of Stable Mon and OSD IPs

Mayank Kumar <krmayankk@xxxxxxxxx> · Wed, 31 Jan 2018 22:21:16 -0800

Thanks Gregory and Burkhard
In kubernetes we use rbd create  and rbd map/unmap commands. In this perspective are you referring to rbd as the client or after the image is created and mapped, is there a different client running inside the kernel that you are referring to which can get osd and mon updates ?

My question is mainly after we have run the rbd ccreate and rbd map commands, does a  client still exsit or its gone ? If the rbd image is mapped on a host and then if osd or mon ips change , what happens in this case ?

-Mayank

On Mon, Jan 29, 2018 at 10:25 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
Ceph assumes monitor IP addresses are stable, as they're the identity for the monitor and clients need to know them to connect.
Clients maintain a TCP connection to the monitors while they're running, and monitors publish monitor maps containing all the known monitors in the cluster. These are pushed out to running clients over those stable connections whenever the map changes. When a client isn't connected to the cluster, it relies on the monitor IP address(es) in its ceph.conf (or supplied on the command line) to connect. I'm not sure about Kubernetes, but in OpenStack the monitor IPs need to remain stable once an RBD image is configured because they're permanently stored in the config. (Or you can update the OpenStack config data, but it takes some pretty serious doing and doesn't have good tooling.)

There's more to the monitor IPs than just the clients though. Like I said, the IP is considered the monitor's identity. I'm not sure offhand what happens if you change it and then boot up an existing store; it may automatically connect or you may need to do some manual commands. Either way, the prior IP will certainly remain in the monitor map (unless you or kubernetes does something to remove it) and that means you've added a "monitor" that nobody will ever be able to connect to. Do that to all of the monitors, and they won't be able to do any paxos consensus and things will grind to a halt.

In contrast, the OSD IPs don't matter at all on their own. I'd just be worried about if whatever's changing the IP also changes the hostname or otherwise causes the OSD to move around in the crush map, as that will generate a great deal of data movement.
-Greg

On Fri, Jan 26, 2018 at 11:50 AM Mayank Kumar <krmayankk@xxxxxxxxx> wrote:
Resending in case this email was lost 

On Tue, Jan 23, 2018 at 10:50 PM Mayank Kumar <krmayankk@xxxxxxxxx> wrote:
Thanks Burkhard for the detailed explanation. Regarding the following:-

>>>The ceph client (librbd accessing a volume in this case) gets asynchronous notification from the ceph mons in case of relevant changes, e.g. updates to the osd map reflecting the failure of an OSD.
i have some more questions:-
1:  Does the asynchronous notification for both osdmap and monmap comes from mons ?
2:  Are these asynchronous notifications retriable ?
3: Is it possible that the these asynchronous notifications are lost  ?
4: Does the monmap and osdmap reside in the kernel or user space ? The reason i am asking is , for a rbd volume that is already mounted on a host, will it continue to receive those asynchronoous notifications for changes to both osd and mon ips or not ? If All mon ips change,  but the mon configuration file is updated to reflect the new mon ips, should the existing rbd volume mounted still be able to contact the osd's and mons or is there some form of caching in the kernel space for an already mounted rbd volume

Some more context for why i am getting all these doubts:-
We internally had a ceph cluster with rbd volumes being provisioned by Kubernetes. With existing rbd volumes already mounted , we wiped out the old ceph cluster and created a brand new ceph cluster . But the existing rbd volumes from the old cluster still remained. Any kubernetes pods that landed on the same host as an old rbd volume would not create because the volume failed to attach and mount. Looking at the kernel messages we saw the following:-

-- Logs begin at Fri 2018-01-19 02:05:38 GMT, end at Fri 2018-01-19 19:23:14 GMT. --
Jan 19 19:20:39 host1.com kernel: libceph: osd2 10.231.171.131:6808 socket closed (con state CONNECTING)
Jan 19 19:18:30 host1.com kernel: libceph: osd28 10.231.171.52:6808 socket closed (con state CONNECTING)
Jan 19 19:18:30 host1.com kernel: libceph: osd0 10.231.171.131:6800 socket closed (con state CONNECTING)
Jan 19 19:15:40 host1.com kernel: libceph: osd21 10.231.171.99:6808 wrong peer at address
Jan 19 19:15:40 host1.com kernel: libceph: wrong peer, want 10.231.171.99:6808/42661, got 10.231.171.99:6808/73168
Jan 19 19:15:34 host1.com kernel: libceph: osd11 10.231.171.114:6816 wrong peer at address
Jan 19 19:15:34 host1.com kernel: libceph: wrong peer, want 10.231.171.114:6816/130908, got 10.231.171.114:6816/85562

The Ceph cluster had new osd ip and mon ips.

So my questions, since these messages are coming from the kernel module, why cant the kernel module figure out that the mon and osd ips have changed. Is there some caching in the kernel ? when rbd create/attach is called on that host, it is passed new mon ips , so doesnt that update the old already mounted rbd volumes. 

Hope i made my doubts clear and yes i am a beginner in Ceph with very limited knowledge. 

Thanks for your help again
Mayank

On Tue, Jan 23, 2018 at 1:24 AM, Burkhard Linke <Burkhard.Linke@computational.bio.uni-giessen.de> wrote:
Hi,

On 01/23/2018 09:53 AM, Mayank Kumar wrote:

Hi Ceph Experts

I am a new user of Ceph and currently using Kubernetes to deploy Ceph RBD Volumes. We our doing some initial work rolling it out to internal customers and in doing that we are using the ip of the host as the ip of the osd and mons. This means if a host goes down , we loose that ip. While we are still experimenting with these behaviors, i wanted to see what the community thinks for the following scenario :-

1: a rbd volume is already attached and mounted on host A

2: the osd on which this rbd volume resides, dies and never comes back up

3: another osd is replaced in its place. I dont know the intricacies here, but i am assuming the data for this rbd volume either moves to different osd's or goes back to the newly installed osd

4: the new osd has completley new ip

5: will the rbd volume attached to host A learn the new osd ip on which its data resides and everything just continues to work ?

What if all the mons also have changed ip ?

A volume does not reside "on a osd". The volume is striped, and each strip is stored in a placement group; the placement group on the other hand is distributed to several OSDs depending on the crush rules and the number of replicates.

If an OSD dies, ceph will backfill the now missing replicates to another OSD, given another OSD satisfying the crush rules is available. The same process is also triggered if an OSD is added.

This process is somewhat transparent to the ceph client, as long as enough replicates a present. The ceph client (librbd accessing a volume in this case) gets asynchronous notification from the ceph mons in case of relevant changes, e.g. updates to the osd map reflecting the failure of an OSD. Traffic to the OSD will be automatically rerouted depending on the crush rules as explained above. The OSD map also contains the IP address of all OSDs, so changes to the IP address are just another update to the map.

The only problem you might run into is changing the IP address of the mons. There's also a mon map listing all active mons; if the mon a ceph client is using dies/is removed, the client will switch to another active mon from the map. This works fine in a running system; you can change the IP address of a mon one by one without any interruption to the client (theoretically....).

The problem is starting the ceph client. In this case the client uses the list of mons from the ceph configuration file to contact one mon and receive the initial mon map. If you change the hostnames/IP address of the mons, you also need to update the ceph configuration file.

The above outline is how it should work, given a valid ceph and network setup. YMMV.

Regards,

Burkhard

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com