Re: Help Ceph Cluster Down

Chris <bitskrieg@xxxxxxxxxxxxx> · Thu, 03 Jan 2019 22:19:15 -0500

If you added OSDs and then deleted them repeatedly without waiting for replication to finish as the cluster attempted to re-balance across them, its highly likely that you are permanently missing PGs (especially if the disks were zapped each time).

If those 3 down OSDs can be revived there is a (small) chance that you can right the ship, but 1400pg/OSD is pretty extreme.  I'm surprised the cluster even let you do that - this sounds like a data loss event.

Bring back the 3 OSD and see what those 2 inconsistent pgs look like with ceph pg query.

On January 3, 2019 21:59:38 Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx> wrote:

Hi, 

Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy tool. Since I was experimenting with tool and ended up deleting OSD nodes on new server couple of times. 

Now since ceph OSDs are running on new server cluster PGs seems to be inactive (10-15%) and they are not recovering or rebalancing. Not sure what to do. I tried shutting down OSDs on new server. 

Status: 
[root@fre105 ~]# ceph -s
2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) No such file or directory
  cluster:
    id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
    health: HEALTH_ERR
            3 pools have many more objects per pg than average
            373907/12391198 objects misplaced (3.018%)
            2 scrub errors
            9677 PGs pending on creation
            Reduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 2717 pgs stale
            Possible data damage: 2 pgs inconsistent
            Degraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized
            52486 slow requests are blocked > 32 sec
            9287 stuck requests are blocked > 4096 sec
            too many PGs per OSD (2968 > max 200)

  services:
    mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
    mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
    osd: 39 osds: 36 up, 36 in; 51 remapped pgs
    rgw: 1 daemon active

  data:
    pools:   18 pools, 54656 pgs
    objects: 6050k objects, 10941 GB
    usage:   21727 GB used, 45308 GB / 67035 GB avail
    pgs:     13.073% pgs not active
             178350/12391198 objects degraded (1.439%)
             373907/12391198 objects misplaced (3.018%)
             46177 active+clean
             5054  down
             1173  stale+down
             1084  stale+active+undersized
             547   activating
             201   stale+active+undersized+degraded
             158   stale+activating
             96    activating+degraded
             46    stale+active+clean
             42    activating+remapped
             34    stale+activating+degraded
             23    stale+activating+remapped
             6     stale+activating+undersized+degraded+remapped
             6     activating+undersized+degraded+remapped
             2     activating+degraded+remapped
             2     active+clean+inconsistent
             1     stale+activating+degraded+remapped
             1     stale+active+clean+remapped
             1     stale+remapped
             1     down+remapped
             1     remapped+peering

  io:
    client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr

Thanks
-- 
Arun Poonia

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com