Re: After reboot nothing worked

Umar Draz <unix.co@xxxxxxxxx> · Tue, 17 Dec 2013 15:39:39 +0500

Hi Karan,
Thanks for your reply, OK I have spent some time on it and finally found a problem regarding this issue

1) If I reboot any of the node, and when its back then the OSD service are not start due to unmount of /var/lib/ceph/osd/ceph-0
    then I manually edit /etc/fstab and add the mount point of ceph osd storage e.g. 

    UUID=142136cd-8325-44a7-ad67-80fe19ed3873 /var/lib/ceph/osd/ceph-0 xfs defaults,noatime

   the above fixed the iss. Now question: is the valid approach? and why on reboot ceph not activated the osd drive?

2) After fixing the above issue I again reboot my all nodes, now this time there is another warning 

    health HEALTH_WARN clock skew detected on mon.vms2

    here is the output

     health HEALTH_WARN clock skew detected on mon.vms2

     monmap e1: 2 mons at {vms1=192.168.1.128:6789/0,vms2=192.168.1.129:6789/0}, election epoch 14, quorum 0,1 vms1,vms2
     mdsmap e11: 1/1/1 up {0=vms1=up:active}
     osdmap e36: 3 osds: 3 up, 3 in

My current setup is 3 osd, 2 mons and 1 msd

Br.

Umar

On Tue, Dec 17, 2013 at 2:54 PM, Karan Singh <ksingh@xxxxxx> wrote:

Umar

Ceph is stable for production , there are a large number of ceph clusters deployed and running smoothly in PRODUCTIONS and countless in testing / pre-production.  

Since you are facing problems with your ceph testing , it does not mean CEPH is unstable. 

I would suggest put some time troubleshooting your problem.

What i see from your logs  --

 1) you have 2 Mons thats a problem ( either have 1  or have 3 to form quorum ) . Add 1 more monitor node 
 2)  out of 2 OSD , only 1 is IN , check where is the other one and try bringing both of them UP . Add few more OSD's to remove health warning . 2 OSD is a very less numbers for OSD

Many Thanks
Karan Singh

From: "Umar Draz" <unix.co@xxxxxxxxx>
To: ceph-users@xxxxxxxx
Sent: Tuesday, 17 December, 2013 8:51:27 AM

Subject:  After reboot nothing worked

Hello,
I have 2 node ceph cluster, I just rebooted both of the host just for testing that after rebooting the cluster remain work or not, and the result was cluster unable to start.

here is ceph -s output

     health HEALTH_WARN 704 pgs stale; 704 pgs stuck stale; mds cluster is degraded; 1/1 in osds are down; clock skew detected on mon.kvm2
     monmap e2: 2 mons at {kvm1=192.168.214.10:6789/0,kvm2=192.168.214.11:6789/0}, election epoch 16, quorum 0,1 kvm1,kvm2

     mdsmap e13: 1/1/1 up {0=kvm1=up:replay}
     osdmap e29: 2 osds: 0 up, 1 in
      pgmap v68: 704 pgs, 4 pools, 9603 bytes data, 23 objects
            1062 MB used, 80816 MB / 81879 MB avail

                 704 stale+active+clean

according to this useless documentation.

http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/

I tried ceph osd tree

the output was

# id    weight  type name       up/down reweight
-1      0.16    root default
-2      0.07999         host kvm1

0       0.07999                 osd.0   down    1
-3      0.07999         host kvm2
1       0.07999                 osd.1   down    0

Then i tried

sudo /etc/init.d/ceph -a start osd.0
sudo /etc/init.d/ceph -a start osd.1

to start the osd on both host the result was

/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )

/etc/init.d/ceph: osd.1 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )

Now question is what is this? is really ceph is stable? can we use this for production environment?

My both host has ntp running the time is upto date.

Br.

Umar

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Umar Draz
Network Architect

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com