Re: Help Ceph Cluster Down

Caspar Smit <casparsmit@xxxxxxxxxxx> · Fri, 4 Jan 2019 14:38:18 +0100

Are the numbers still decreasing?
This one for instance:

"3883 PGs pending on creation" 

Caspar

Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx>:
Hi Caspar, 
Yes, cluster was working fine with number of PGs per OSD warning up until now. I am not sure how to recover from stale down/inactive PGs. If you happen to know about this can you let me know?

Current State: 

[root@fre101 ~]# ceph -s
2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f31480017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2) No such file or directory
  cluster:
    id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
    health: HEALTH_ERR
            3 pools have many more objects per pg than average
            505714/12392650 objects misplaced (4.081%)
            3883 PGs pending on creation
            Reduced data availability: 6519 pgs inactive, 1870 pgs down, 1 pg peering, 886 pgs stale
            Degraded data redundancy: 42987/12392650 objects degraded (0.347%), 634 pgs degraded, 16 pgs undersized
            125827 slow requests are blocked > 32 sec
            2 stuck requests are blocked > 4096 sec
            too many PGs per OSD (2758 > max 200)

  services:
    mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
    mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
    osd: 39 osds: 39 up, 39 in; 76 remapped pgs
    rgw: 1 daemon active

  data:
    pools:   18 pools, 54656 pgs
    objects: 6051k objects, 10944 GB
    usage:   21933 GB used, 50688 GB / 72622 GB avail
    pgs:     11.927% pgs not active
             42987/12392650 objects degraded (0.347%)
             505714/12392650 objects misplaced (4.081%)
             48080 active+clean
             3885  activating
             1111  down
             759   stale+down
             614   activating+degraded
             74    activating+remapped
             46    stale+active+clean
             35    stale+activating
             21    stale+activating+remapped
             9     stale+active+undersized
             9     stale+activating+degraded
             5     stale+activating+undersized+degraded+remapped
             3     activating+degraded+remapped
             1     stale+activating+degraded+remapped
             1     stale+active+undersized+degraded
             1     remapped+peering
             1     active+clean+remapped
             1     activating+undersized+degraded+remapped

  io:
    client:   0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr

I will update number of PGs per OSD once these inactive or stale PGs come online. I am not able to access VMs (VMs, Images) which are using Ceph. 

Thanks
Arun

On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
Hi Arun,

How did you end up with a 'working' cluster with so many pgs per OSD?

"too many PGs per OSD (2968 > max 200)"

To (temporarily) allow this kind of pgs per osd you could try this:

Change these values in the global section in your ceph.conf:

mon max pg per osd = 200
osd max pg per osd hard ratio = 2

It allows 200*2 = 400 Pgs per OSD before disabling the creation of new pgs.   

Above are the defaults (for Luminous, maybe other versions too)
You can check your current settings with:

ceph daemon mon.ceph-mon01 config show |grep pg_per_osd

Since your current pgs per osd ratio is way higher then the default you could set them to for instance:

mon max pg per osd = 1000
osd max pg per osd hard ratio = 5

Which allow for 5000 pgs per osd before disabling creation of new pgs.

You'll need to inject the setting into the mons/osds and restart mgrs to make them active.

ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’ 
ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ 
ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’
ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ 
restart mgrs

Kind regards,
Caspar

Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx>:
Hi Chris, 
Indeed that's what happened. I didn't set noout flag either and I did zapped disk on new server every time. In my cluster status fre201 is only new server. 

Current Status after enabling 3 OSDs on fre201 host. 

[root@fre201 ~]# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
 -1       70.92137 root default
 -2        5.45549     host fre101
  0   hdd  1.81850         osd.0       up  1.00000 1.00000
  1   hdd  1.81850         osd.1       up  1.00000 1.00000
  2   hdd  1.81850         osd.2       up  1.00000 1.00000
 -9        5.45549     host fre103
  3   hdd  1.81850         osd.3       up  1.00000 1.00000
  4   hdd  1.81850         osd.4       up  1.00000 1.00000
  5   hdd  1.81850         osd.5       up  1.00000 1.00000
 -3        5.45549     host fre105
  6   hdd  1.81850         osd.6       up  1.00000 1.00000
  7   hdd  1.81850         osd.7       up  1.00000 1.00000
  8   hdd  1.81850         osd.8       up  1.00000 1.00000
 -4        5.45549     host fre107
  9   hdd  1.81850         osd.9       up  1.00000 1.00000
 10   hdd  1.81850         osd.10      up  1.00000 1.00000
 11   hdd  1.81850         osd.11      up  1.00000 1.00000
 -5        5.45549     host fre109
 12   hdd  1.81850         osd.12      up  1.00000 1.00000
 13   hdd  1.81850         osd.13      up  1.00000 1.00000
 14   hdd  1.81850         osd.14      up  1.00000 1.00000
 -6        5.45549     host fre111
 15   hdd  1.81850         osd.15      up  1.00000 1.00000
 16   hdd  1.81850         osd.16      up  1.00000 1.00000
 17   hdd  1.81850         osd.17      up  0.79999 1.00000
 -7        5.45549     host fre113
 18   hdd  1.81850         osd.18      up  1.00000 1.00000
 19   hdd  1.81850         osd.19      up  1.00000 1.00000
 20   hdd  1.81850         osd.20      up  1.00000 1.00000
 -8        5.45549     host fre115
 21   hdd  1.81850         osd.21      up  1.00000 1.00000
 22   hdd  1.81850         osd.22      up  1.00000 1.00000
 23   hdd  1.81850         osd.23      up  1.00000 1.00000
-10        5.45549     host fre117
 24   hdd  1.81850         osd.24      up  1.00000 1.00000
 25   hdd  1.81850         osd.25      up  1.00000 1.00000
 26   hdd  1.81850         osd.26      up  1.00000 1.00000
-11        5.45549     host fre119
 27   hdd  1.81850         osd.27      up  1.00000 1.00000
 28   hdd  1.81850         osd.28      up  1.00000 1.00000
 29   hdd  1.81850         osd.29      up  1.00000 1.00000
-12        5.45549     host fre121
 30   hdd  1.81850         osd.30      up  1.00000 1.00000
 31   hdd  1.81850         osd.31      up  1.00000 1.00000
 32   hdd  1.81850         osd.32      up  1.00000 1.00000
-13        5.45549     host fre123
 33   hdd  1.81850         osd.33      up  1.00000 1.00000
 34   hdd  1.81850         osd.34      up  1.00000 1.00000
 35   hdd  1.81850         osd.35      up  1.00000 1.00000
-27        5.45549     host fre201
 36   hdd  1.81850         osd.36      up  1.00000 1.00000
 37   hdd  1.81850         osd.37      up  1.00000 1.00000
 38   hdd  1.81850         osd.38      up  1.00000 1.00000
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]# ceph -s
  cluster:
    id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
    health: HEALTH_ERR
            3 pools have many more objects per pg than average
            585791/12391450 objects misplaced (4.727%)
            2 scrub errors
            2374 PGs pending on creation
            Reduced data availability: 6578 pgs inactive, 2025 pgs down, 74 pgs peering, 1234 pgs stale
            Possible data damage: 2 pgs inconsistent
            Degraded data redundancy: 64969/12391450 objects degraded (0.524%), 616 pgs degraded, 20 pgs undersized
            96242 slow requests are blocked > 32 sec
            228 stuck requests are blocked > 4096 sec
            too many PGs per OSD (2768 > max 200)

  services:
    mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
    mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
    osd: 39 osds: 39 up, 39 in; 96 remapped pgs
    rgw: 1 daemon active

  data:
    pools:   18 pools, 54656 pgs
    objects: 6050k objects, 10942 GB
    usage:   21900 GB used, 50721 GB / 72622 GB avail
    pgs:     0.002% pgs unknown
             12.050% pgs not active
             64969/12391450 objects degraded (0.524%)
             585791/12391450 objects misplaced (4.727%)
             47489 active+clean
             3670  activating
             1098  stale+down
             923   down
             575   activating+degraded
             563   stale+active+clean
             105   stale+activating
             78    activating+remapped
             72    peering
             25    stale+activating+degraded
             23    stale+activating+remapped
             9     stale+active+undersized
             6     stale+activating+undersized+degraded+remapped
             5     stale+active+undersized+degraded
             4     down+remapped
             4     activating+degraded+remapped
             2     active+clean+inconsistent
             1     stale+activating+degraded+remapped
             1     stale+active+clean+remapped
             1     stale+remapped+peering
             1     remapped+peering
             1     unknown

  io:
    client:   0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wr

Thanks
Arun

On Thu, Jan 3, 2019 at 7:19 PM Chris <bitskrieg@xxxxxxxxxxxxx> wrote:

If you added OSDs and then deleted them repeatedly without waiting for replication to finish as the cluster attempted to re-balance across them, its highly likely that you are permanently missing PGs (especially if the disks were zapped each time).

If those 3 down OSDs can be revived there is a (small) chance that you can right the ship, but 1400pg/OSD is pretty extreme.  I'm surprised the cluster even let you do that - this sounds like a data loss event.

Bring back the 3 OSD and see what those 2 inconsistent pgs look like with ceph pg query.

On January 3, 2019 21:59:38 Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx> wrote:

Hi, 

Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy tool. Since I was experimenting with tool and ended up deleting OSD nodes on new server couple of times. 

Now since ceph OSDs are running on new server cluster PGs seems to be inactive (10-15%) and they are not recovering or rebalancing. Not sure what to do. I tried shutting down OSDs on new server. 

Status: 
[root@fre105 ~]# ceph -s
2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) No such file or directory
  cluster:
    id:     adb9ad8e-f458-4124-bf58-7963a8d1391f
    health: HEALTH_ERR
            3 pools have many more objects per pg than average
            373907/12391198 objects misplaced (3.018%)
            2 scrub errors
            9677 PGs pending on creation
            Reduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 2717 pgs stale
            Possible data damage: 2 pgs inconsistent
            Degraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized
            52486 slow requests are blocked > 32 sec
            9287 stuck requests are blocked > 4096 sec
            too many PGs per OSD (2968 > max 200)

  services:
    mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
    mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
    osd: 39 osds: 36 up, 36 in; 51 remapped pgs
    rgw: 1 daemon active

  data:
    pools:   18 pools, 54656 pgs
    objects: 6050k objects, 10941 GB
    usage:   21727 GB used, 45308 GB / 67035 GB avail
    pgs:     13.073% pgs not active
             178350/12391198 objects degraded (1.439%)
             373907/12391198 objects misplaced (3.018%)
             46177 active+clean
             5054  down
             1173  stale+down
             1084  stale+active+undersized
             547   activating
             201   stale+active+undersized+degraded
             158   stale+activating
             96    activating+degraded
             46    stale+active+clean
             42    activating+remapped
             34    stale+activating+degraded
             23    stale+activating+remapped
             6     stale+activating+undersized+degraded+remapped
             6     activating+undersized+degraded+remapped
             2     activating+degraded+remapped
             2     active+clean+inconsistent
             1     stale+activating+degraded+remapped
             1     stale+active+clean+remapped
             1     stale+remapped
             1     down+remapped
             1     remapped+peering

  io:
    client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr

Thanks
-- 
Arun Poonia

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Arun Poonia

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Arun Poonia

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com