Re: v11.2.0 Disk activation issue while booting

nokia ceph <nokiacephusers@xxxxxxxxx> · Wed, 14 Jun 2017 13:32:11 +0530

Hello David,
Thanks for the update.

http://tracker.ceph.com/issues/13833#note-7 - As per this tracker they mentioned that the GUID may differ which cause udev were unable to chown ceph. 

We are following below procedure to create OSD's

#sgdisk -Z /dev/sdb
#ceph-disk prepare --bluestore --cluster ceph --cluster-uuid <fsid> /dev/vdb
#ceph-disk --verbose activate /dev/vdb1 

Here you can see all the device haiving same GUID.

#for i in b c d ; do  /usr/sbin/blkid -o udev -p /dev/vd$i\1 | grep ID_PART_ENTRY_TYPE; done
ID_PART_ENTRY_TYPE=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
ID_PART_ENTRY_TYPE=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
ID_PART_ENTRY_TYPE=4fbd7e29-9d25-41b8-afd0-062c0ceff05d

Currently we are facing issue with the OSD activation while boot.  which caused the OSD journal device mounted like this..

~~~
/dev/sdh1 /var/lib/ceph/tmp/mnt.EayTmL
~~~

At the same time on OSD logs, we getting like, osd.2 can't able to find the mounted journal device hence it landed into failure state..

~~~
May 26 15:40:39 cn1 ceph-osd: 2017-05-26 15:40:39.978072 7f1dc3bc2940 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-2: (2) No such file or directory#033[0m
May 26 15:40:39 cn1 systemd: ceph-osd@2.service: main process exited, code=exited, status=1/FAILURE
May 26 15:40:39 cn1 systemd: Unit ceph-osd@2.service entered failed state.
May 26 15:40:39 cn1 systemd: ceph-osd@2.service failed.
~~~

To fix this problem, we are following below workaround...

#umount /var/lib/ceph/tmp/mnt.om4Lbq 

Mount the device with respective osd number.
#mount /dev/sdb1 /var/lib/ceph/osd/ceph-2

Then start the osd.
#systemctl start ceph-osd@2.service.

We notice below services fail at the same time.

===
systemctl --failed
  UNIT                              LOAD      ACTIVE SUB    DESCRIPTION
● var-lib-ceph-tmp-mnt.UiCYFu.mount not-found failed failed var-lib-ceph-tmp-mnt.UiCYFu.mount
● ceph-disk@dev-sdc1.service        loaded    failed failed Ceph disk activation: /dev/sdc1
● ceph-disk@dev-sdd1.service        loaded    failed failed Ceph disk activation: /dev/sdd1
● ceph-disk@dev-sdd2.service        loaded    failed failed Ceph disk activation: /dev/sdd2
===

Need your suggestion to proceed further 

Thanks
Jayaram

On Tue, Jun 13, 2017 at 7:30 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
I came across this a few times.  My problem was with journals I set up by myself.  I didn't give them the proper GUID partition type ID so the udev rules didn't know how to make sure the partition looked correct.  What the udev rules were unable to do was chown the journal block device as ceph:ceph so that it could be opened by the Ceph user.  You can test by chowning the journal block device and try to start the OSD again.
Alternatively if you want to see more information, you can start the daemon manually as opposed to starting it through systemd and see what its output looks like.

On Tue, Jun 13, 2017 at 6:32 AM nokia ceph <nokiacephusers@xxxxxxxxx> wrote:
Hello,
Some osd's not getting activated after a reboot operation which cause that particular osd's landing in failed state. 

Here you can see mount points were not getting updated to osd-num and mounted as a incorrect mount point, which caused osd.<num> can't able to mount/activate the osd's.

Env:- RHEL 7.2 - EC 4+1, v11.2.0 bluestore.

#grep mnt proc/mounts
/dev/sdh1 /var/lib/ceph/tmp/mnt.om4Lbq xfs rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0
/dev/sdh1 /var/lib/ceph/tmp/mnt.EayTmL xfs rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0

From /var/log/messages.. 

--
May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh2...
May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh1...

May 26 15:39:58 cn1 systemd: start request repeated too quickly for ceph-disk@dev-sdh2.service   => suspecting this could be root cause. 
May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation: /dev/sdh2.
May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh2.service entered failed state.
May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh2.service failed.
May 26 15:39:58 cn1 systemd: start request repeated too quickly for ceph-disk@dev-sdh1.service
May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation: /dev/sdh1.
May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh1.service entered failed state.
May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh1.service failed.
--

But this issue will occur intermittently  after a reboot operation. 

Note;- We haven't face this problem in Jewel.

Awaiting for comments. 

Thanks
Jayaram
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com