Re: Ceph Maintenance

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Tue, 29 Nov 2016 19:26:58 -0800

you can ignore that, its a known issue http://tracker.ceph.com/issues/15990

regardless waht version of ceph are you running and what are the details of os version you updated to ?

On Tue, Nov 29, 2016 at 7:12 PM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
Found some more info, but getting weird... All three OSD nodes shows the same unknown cluster message on all the OSD disks.  I don't know where it came from, all the nodes were configured using ceph-deploy on the admin node.  In any case, the OSD's seem to be up and running, the health is ok.
no ceph-disk@ services are running on any of the OSD nodes which I didn't notice before and each node was setup the exact same, yet there are different services listed under systemctl:

OSD NODE 1:
Output in earlier email

OSD NODE 2:

● ceph-disk@dev-sdb1.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb1
● ceph-disk@dev-sdb2.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb2
● ceph-disk@dev-sdb5.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb5
● ceph-disk@dev-sdc2.service                                                    loaded failed failed    Ceph disk activation: /dev/sdc2
● ceph-disk@dev-sdc4.service                                                    loaded failed failed    Ceph disk activation: /dev/sdc4

OSD NODE 3:
● ceph-disk@dev-sdb1.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb1● ceph-disk@dev-sdb3.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb3
● ceph-disk@dev-sdb4.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb4
● ceph-disk@dev-sdb5.service                                                    loaded failed failed    Ceph disk activation: /dev/sdb5● ceph-disk@dev-sdc2.service                                                    loaded failed failed    Ceph disk activation: /dev/sdc2
● ceph-disk@dev-sdc3.service                                                    loaded failed failed    Ceph disk activation: /dev/sdc3

● ceph-disk@dev-sdc4.service                                                    loaded failed failed    Ceph disk activation: /dev/sdc4From my understanding, the disks have already been activated... Should these services even be running or enabled?
Mike

On Tue, Nov 29, 2016 at 6:33 PM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
Sorry about that... Here is the output of ceph-disk list:
ceph-disk list
/dev/dm-0 other, xfs, mounted on /
/dev/dm-1 swap, swap
/dev/dm-2 other, xfs, mounted on /home
/dev/sda :
 /dev/sda2 other, LVM2_member
 /dev/sda1 other, xfs, mounted on /boot
/dev/sdb :
 /dev/sdb1 ceph journal
 /dev/sdb2 ceph journal
 /dev/sdb3 ceph journal
 /dev/sdb4 ceph journal
 /dev/sdb5 ceph journal
/dev/sdc :
 /dev/sdc1 ceph journal
 /dev/sdc2 ceph journal
 /dev/sdc3 ceph journal
 /dev/sdc4 ceph journal
 /dev/sdc5 ceph journal
/dev/sdd :
 /dev/sdd1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.0
/dev/sde :
 /dev/sde1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.1
/dev/sdf :
 /dev/sdf1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.2
/dev/sdg :
 /dev/sdg1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.3
/dev/sdh :
 /dev/sdh1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.4
/dev/sdi :
 /dev/sdi1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.5
/dev/sdj :
 /dev/sdj1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.6
/dev/sdk :
 /dev/sdk1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.7
/dev/sdl :
 /dev/sdl1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.8
/dev/sdm :
 /dev/sdm1 ceph data, active, unknown cluster e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9, osd.9

On Tue, Nov 29, 2016 at 6:32 PM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
I forgot to add:

On Tue, Nov 29, 2016 at 6:28 PM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
So it looks like the journal partition is mounted:
ls -lah /var/lib/ceph/osd/ceph-0/journal
lrwxrwxrwx. 1 ceph ceph 9 Oct 10 16:11 /var/lib/ceph/osd/ceph-0/journal -> /dev/sdb1

Here is the output of journalctl -xe when I try to start the ceph-diak@dev-sdb1 service:

sh[17481]: mount_activate: Failed to activate
sh[17481]: unmount: Unmounting /var/lib/ceph/tmp/mnt.m9ek7W
sh[17481]: command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.m9ek7W
sh[17481]: Traceback (most recent call last):
sh[17481]: File "/usr/sbin/ceph-disk", line 9, in <module>
sh[17481]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5011, in run
sh[17481]: main(sys.argv[1:])
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4962, in main
sh[17481]: args.func(args)
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4720, in <lambda>
sh[17481]: func=lambda args: main_activate_space(name, args),
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3739, in main_activate_space
sh[17481]: reactivate=args.reactivate,
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3073, in mount_activate
sh[17481]: (osd_id, cluster) = activate(path, activate_key_template, init)
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3220, in activate
sh[17481]: ' with fsid %s' % ceph_fsid)
sh[17481]: ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
sh[17481]: Traceback (most recent call last):
sh[17481]: File "/usr/sbin/ceph-disk", line 9, in <module>
sh[17481]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5011, in run
sh[17481]: main(sys.argv[1:])
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4962, in main
sh[17481]: args.func(args)
sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4399, in main_trigger
sh[17481]: raise Error('return code ' + str(ret))
sh[17481]: ceph_disk.main.Error: Error: return code 1
systemd[1]: ceph-disk@dev-sdb1.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Ceph disk activation: /dev/sdb1.

I dont understand this error:
ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9

My fsid in ceph.conf is:
fsid = 75d6dba9-2144-47b1-87ef-1fe21d3c58a8

I don't know why the fsid would change or be different. I thought I had a basic cluster setup, I don't understand what's going wrong.

Mike

On Tue, Nov 29, 2016 at 5:15 PM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
Hi John,
Thanks I wasn't sure if something happened to the journal partitions or not. 

Right now, the ceph-osd.0-9 services are back up and the cluster health is good, but none of the ceph-disk@dev-sd* services are running.   How can I get the Journal partitions mounted again?

Cheers,
Mike

On Tue, Nov 29, 2016 at 4:30 PM, John Petrini <jpetrini@xxxxxxxxxxxx> wrote:
Also, don't run sgdisk again; that's just for creating the journal partitions. ceph-disk is a service used for prepping disks, only the OSD services need to be running as far as I know. Are the ceph-osd@x. services running now that you've mounted the disks?

___
John Petrini
NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com   //             
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422 
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetrini@xxxxxxxxxxxx
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission,  dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On Tue, Nov 29, 2016 at 7:27 PM, John Petrini <jpetrini@xxxxxxxxxxxx> wrote:
What command are you using to start your OSD's?

___
John Petrini
NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com   //             
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422 
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetrini@xxxxxxxxxxxx
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission,  dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On Tue, Nov 29, 2016 at 7:19 PM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
I was able to bring the osd's up by looking at my other OSD node which is the exact same hardware/disks and finding out which disks map.  But I still cant bring up any of the start ceph-disk@dev-sd* services... When I first installed the cluster and got the OSD's up, I had to run the following:
# sgdisk -t 1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
# sgdisk -t 2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
# sgdisk -t 3:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
# sgdisk -t 4:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
# sgdisk -t 5:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
# sgdisk -t 1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
# sgdisk -t 2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
# sgdisk -t 3:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
# sgdisk -t 4:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
# sgdisk -t 5:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc

Do i need to run that again?

Cheers,
Mike

On Tue, Nov 29, 2016 at 4:13 PM, Sean Redmond <sean.redmond1@xxxxxxxxx> wrote:
Normally they mount based upon the gpt label, if it's not working you can mount the disk under /mnt and then cat the file called whoami to find out the osd number 

On 29 Nov 2016 23:56, "Mike Jacobacci" <mikej@xxxxxxxxxx> wrote:
OK I am in some trouble now and would love some help!  After updating none of the OSDs on the node will come back up:
● ceph-disk@dev-sdb1.service                                              loaded failed failed    Ceph disk activation: /dev/sdb1
● ceph-disk@dev-sdb2.service                                              loaded failed failed    Ceph disk activation: /dev/sdb2
● ceph-disk@dev-sdb3.service                                              loaded failed failed    Ceph disk activation: /dev/sdb3
● ceph-disk@dev-sdb4.service                                              loaded failed failed    Ceph disk activation: /dev/sdb4
● ceph-disk@dev-sdb5.service                                              loaded failed failed    Ceph disk activation: /dev/sdb5
● ceph-disk@dev-sdc1.service                                              loaded failed failed    Ceph disk activation: /dev/sdc1
● ceph-disk@dev-sdc2.service                                              loaded failed failed    Ceph disk activation: /dev/sdc2
● ceph-disk@dev-sdc3.service                                              loaded failed failed    Ceph disk activation: /dev/sdc3
● ceph-disk@dev-sdc4.service                                              loaded failed failed    Ceph disk activation: /dev/sdc4
● ceph-disk@dev-sdc5.service                                              loaded failed failed    Ceph disk activation: /dev/sdc5
● ceph-disk@dev-sdd1.service                                              loaded failed failed    Ceph disk activation: /dev/sdd1
● ceph-disk@dev-sde1.service                                              loaded failed failed    Ceph disk activation: /dev/sde1
● ceph-disk@dev-sdf1.service                                              loaded failed failed    Ceph disk activation: /dev/sdf1
● ceph-disk@dev-sdg1.service                                              loaded failed failed    Ceph disk activation: /dev/sdg1
● ceph-disk@dev-sdh1.service                                              loaded failed failed    Ceph disk activation: /dev/sdh1
● ceph-disk@dev-sdi1.service                                              loaded failed failed    Ceph disk activation: /dev/sdi1
● ceph-disk@dev-sdj1.service                                              loaded failed failed    Ceph disk activation: /dev/sdj1
● ceph-disk@dev-sdk1.service                                              loaded failed failed    Ceph disk activation: /dev/sdk1
● ceph-disk@dev-sdl1.service                                              loaded failed failed    Ceph disk activation: /dev/sdl1
● ceph-disk@dev-sdm1.service                                              loaded failed failed    Ceph disk activation: /dev/sdm1
● ceph-osd@0.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@1.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@2.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@3.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@4.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@5.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@6.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@7.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@8.service                                                      loaded failed failed    Ceph object storage daemon
● ceph-osd@9.service                                                      loaded failed failed    Ceph object storage daemon

I did some searching and saw that the issue is that the disks aren't mounting... My question is how can I mount them correctly again (note sdb and sdc are ssd for cache)? I am not sure which disk maps to ceph-osd@0 and so on.  Also, can I add them to /etc/fstab to work around?

Cheers,
Mike

On Tue, Nov 29, 2016 at 10:41 AM, Mike Jacobacci <mikej@xxxxxxxxxx> wrote:
Hello,
I would like to install OS updates on the ceph cluster and activate a second 10gb port on the OSD nodes, so I wanted to verify the correct steps to perform maintenance on the cluster.  We are only using rbd to back our xenserver vm's at this point, and our cluster consists of 3 OSD nodes, 3 Mon nodes and 1 admin node...  So would this be the correct steps:

1. Shut down VM's?
2. run "ceph osd set noout" on admin node
3. install updates on each monitoring node and reboot one at a time.
4. install updates on OSD nodes and activate second 10gb port, reboot one OSD node at a time
5. once all nodes back up, run "ceph osd unset noout"
6. bring VM's back online

Does this sound correct?

Cheers,
Mike

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com