Re: Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, 

For your information and for all in the same situation than me. 

I found on release notes that's very well explained that when the
server is down but controller doesn't know about it. It can be because
the upgrades done in ceph during several releases. In this case firefly
there are some instructions 



        Upgrade Ceph on monitor hosts

        Restart all ceph-mon daemons

        Set noout::

            ceph osd set noout

        Upgrade Ceph on all OSD hosts

        Stop all ceph-osd daemons

        Mark all OSDs down with something like::

            ceph osd down seq 0 1000

        Start all ceph-osd daemons

        Let the cluster settle and then unset noout::

            ceph osd unset noout

        Upgrade and restart any remaining daemons (ceph-mds, radosgw)


In my case relevant point was to mark the hanged osd as down. After
everything started to work again. 

I suppose it was in a stale situation.On mié, 2016-05-11 at 09:37 +0200, Gonzalo Aguilar Delgado wrote:
> Hello again, 
> 
> I was looking at the patches sent on the repository and I found a
> patch that made the OSD to check for cluster health before starting
> up. 
> 
> Can this be patch the source of all my problems?
> 
> 
> Best regards,
> 
> On Tue, May 10, 2016 at 6:07 PM, Gonzalo Aguilar Delgado <gaguilar.de
> lgado@xxxxxxxxx> wrote:
> > Hello, 
> > 
> > I just upgraded my cluster to the version 10.1.2 and it worked well
> > for a while until I saw that systemctl ceph-disk@dev-sdc1.service
> > was failed and I reruned it.
> > 
> > From there the OSD stopped working. 
> > 
> > This is ubuntu 16.04. 
> > 
> > I connected to the IRC looking for help where people pointed me to
> > one or another place but none of the investigations helped to
> > resolve.
> > 
> > My configuration is rather simple:
> > 
> > oot@red-compute:~# ceph osd tree
> > ID WEIGHT  TYPE NAME                 UP/DOWN REWEIGHT PRIMARY-
> > AFFINITY 
> > -1 1.00000 root
> > default                                                
> > -4 1.00000     rack rack-
> > 1                                             
> > -2 1.00000         host blue-
> > compute                                   
> >  0 1.00000             osd.0            down        0         
> > 1.00000 
> >  2 1.00000             osd.2            down        0         
> > 1.00000 
> > -3 1.00000         host red-
> > compute                                    
> >  1 1.00000             osd.1            down        0         
> > 1.00000 
> >  3 0.50000             osd.3              up  1.00000         
> > 1.00000 
> >  4 1.00000             osd.4            down        0         
> > 1.00000 
> > 
> > 
> > 
> > This is what I got sofar:
> > 
> > Once upgraded I discovered that daemon runs under ceph. I just ran
> > chown on ceph directories. and it worked. 
> > Firewall is fully disabled. Checked connectivity with nc and nmap. 
> > Configuration seems to be right. I can post if you want. 
> > Enabling logging on OSD shows that for example osd.1 is
> > reconnecting all the time.
> > 2016-05-10 14:35:48.199573 7f53e8f1a700  1 -- 0.0.0.0:6806/13962 >>
> > :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0
> > c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
> >  2016-05-10 14:35:48.199966 7f53e8f1a700  2 -- 0.0.0.0:6806/13962
> > >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0
> > c=0x556f993b3a80).fault (0) Success
> >  2016-05-10 14:35:48.200018 7f53fb941700  1 osd.1 2468
> > ms_handle_reset con 0x556f993b3a80 session 0
> > OSD.3 goes ok because never left out because ceph restriction.
> > I rebooted all services at once for it to have available all OSD at
> > the same time and don't mark it down. Don't work. 
> > I forced up from commandline. ceph osd in 1-5. They appear as in
> > for a while then out.
> > We tried ceph-disk activate-all to boot everything. Don't work.
> > 
> > The strange thing is that culster started worked just right after
> > upgrade. But the systemctrl command broke both servers. 
> > root@blue-compute:~# ceph -w
> >     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
> >      health HEALTH_ERR
> >             694 pgs are stuck inactive for more than 300 seconds
> >             694 pgs stale
> >             694 pgs stuck stale
> >             too many PGs per OSD (1528 > max 300)
> >             mds cluster is degraded
> >             crush map has straw_calc_version=0
> >      monmap e10: 2 mons at {blue-compute=172.16.0.119:6789/0,red-
> > compute=172.16.0.100:6789/0}
> >             election epoch 3600, quorum 0,1 red-compute,blue-
> > compute
> >       fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
> >      osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
> >       pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
> >             87641 MB used, 212 GB / 297 GB avail
> >                  694 stale+active+clean
> >                   70 active+clean
> > 
> > 2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are
> > stuck inactive for more than 300 seconds; 694 pgs stale; 694 pgs
> > stuck stale; too many PGs per OSD (1528 > max 300); mds cluster is
> > degraded; crush map has straw_calc_version=
> > cat /etc/ceph/ceph.conf 
> > [global]
> > 
> > fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
> > mon_initial_members = blue-compute, red-compute
> > mon_host = 172.16.0.119, 172.16.0.100
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > public_network = 172.16.0.0/24
> > osd_pool_default_pg_num = 100
> > osd_pool_default_pgp_num = 100
> > osd_pool_default_size = 2  # Write an object 3 times.
> > osd_pool_default_min_size = 1 # Allow writing one copy in a
> > degraded state.
> > 
> > ## Required upgrade
> > osd max object name len = 256
> > osd max object namespace len = 64
> > 
> > [mon.]
> > 
> >     debug mon = 9
> >     caps mon = "allow *"
> > 
> > Any help on this? Any clue of what's going wrong?
> > 
> > Best regards,
> > 
> > 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux