Are the numbers still decreasing?
This one for instance:
This one for instance:
"3883 PGs pending on creation"
Caspar
Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx>:
Hi Caspar,Yes, cluster was working fine with number of PGs per OSD warning up until now. I am not sure how to recover from stale down/inactive PGs. If you happen to know about this can you let me know?Current State:[root@fre101 ~]# ceph -s2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f31480017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2) No such file or directorycluster:id: adb9ad8e-f458-4124-bf58-7963a8d1391fhealth: HEALTH_ERR3 pools have many more objects per pg than average505714/12392650 objects misplaced (4.081%)3883 PGs pending on creationReduced data availability: 6519 pgs inactive, 1870 pgs down, 1 pg peering, 886 pgs staleDegraded data redundancy: 42987/12392650 objects degraded (0.347%), 634 pgs degraded, 16 pgs undersized125827 slow requests are blocked > 32 sec2 stuck requests are blocked > 4096 sectoo many PGs per OSD (2758 > max 200)services:mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02osd: 39 osds: 39 up, 39 in; 76 remapped pgsrgw: 1 daemon activedata:pools: 18 pools, 54656 pgsobjects: 6051k objects, 10944 GBusage: 21933 GB used, 50688 GB / 72622 GB availpgs: 11.927% pgs not active42987/12392650 objects degraded (0.347%)505714/12392650 objects misplaced (4.081%)48080 active+clean3885 activating1111 down759 stale+down614 activating+degraded74 activating+remapped46 stale+active+clean35 stale+activating21 stale+activating+remapped9 stale+active+undersized9 stale+activating+degraded5 stale+activating+undersized+degraded+remapped3 activating+degraded+remapped1 stale+activating+degraded+remapped1 stale+active+undersized+degraded1 remapped+peering1 active+clean+remapped1 activating+undersized+degraded+remappedio:client: 0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wrI will update number of PGs per OSD once these inactive or stale PGs come online. I am not able to access VMs (VMs, Images) which are using Ceph.ThanksArunOn Fri, Jan 4, 2019 at 4:53 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:Hi Arun,How did you end up with a 'working' cluster with so many pgs per OSD?
"too many PGs per OSD (2968 > max 200)"
To (temporarily) allow this kind of pgs per osd you could try this:Change these values in the global section in your ceph.conf:mon max pg per osd = 200
osd max pg per osd hard ratio = 2It allows 200*2 = 400 Pgs per OSD before disabling the creation of new pgs.Above are the defaults (for Luminous, maybe other versions too)You can check your current settings with:ceph daemon mon.ceph-mon01 config show |grep pg_per_osdSince your current pgs per osd ratio is way higher then the default you could set them to for instance:mon max pg per osd = 1000
osd max pg per osd hard ratio = 5Which allow for 5000 pgs per osd before disabling creation of new pgs.You'll need to inject the setting into the mons/osds and restart mgrs to make them active.ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’restart mgrsKind regards,Caspar_______________________________________________Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx>:Hi Chris,Indeed that's what happened. I didn't set noout flag either and I did zapped disk on new server every time. In my cluster status fre201 is only new server.Current Status after enabling 3 OSDs on fre201 host.[root@fre201 ~]# ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 70.92137 root default-2 5.45549 host fre1010 hdd 1.81850 osd.0 up 1.00000 1.000001 hdd 1.81850 osd.1 up 1.00000 1.000002 hdd 1.81850 osd.2 up 1.00000 1.00000-9 5.45549 host fre1033 hdd 1.81850 osd.3 up 1.00000 1.000004 hdd 1.81850 osd.4 up 1.00000 1.000005 hdd 1.81850 osd.5 up 1.00000 1.00000-3 5.45549 host fre1056 hdd 1.81850 osd.6 up 1.00000 1.000007 hdd 1.81850 osd.7 up 1.00000 1.000008 hdd 1.81850 osd.8 up 1.00000 1.00000-4 5.45549 host fre1079 hdd 1.81850 osd.9 up 1.00000 1.0000010 hdd 1.81850 osd.10 up 1.00000 1.0000011 hdd 1.81850 osd.11 up 1.00000 1.00000-5 5.45549 host fre10912 hdd 1.81850 osd.12 up 1.00000 1.0000013 hdd 1.81850 osd.13 up 1.00000 1.0000014 hdd 1.81850 osd.14 up 1.00000 1.00000-6 5.45549 host fre11115 hdd 1.81850 osd.15 up 1.00000 1.0000016 hdd 1.81850 osd.16 up 1.00000 1.0000017 hdd 1.81850 osd.17 up 0.79999 1.00000-7 5.45549 host fre11318 hdd 1.81850 osd.18 up 1.00000 1.0000019 hdd 1.81850 osd.19 up 1.00000 1.0000020 hdd 1.81850 osd.20 up 1.00000 1.00000-8 5.45549 host fre11521 hdd 1.81850 osd.21 up 1.00000 1.0000022 hdd 1.81850 osd.22 up 1.00000 1.0000023 hdd 1.81850 osd.23 up 1.00000 1.00000-10 5.45549 host fre11724 hdd 1.81850 osd.24 up 1.00000 1.0000025 hdd 1.81850 osd.25 up 1.00000 1.0000026 hdd 1.81850 osd.26 up 1.00000 1.00000-11 5.45549 host fre11927 hdd 1.81850 osd.27 up 1.00000 1.0000028 hdd 1.81850 osd.28 up 1.00000 1.0000029 hdd 1.81850 osd.29 up 1.00000 1.00000-12 5.45549 host fre12130 hdd 1.81850 osd.30 up 1.00000 1.0000031 hdd 1.81850 osd.31 up 1.00000 1.0000032 hdd 1.81850 osd.32 up 1.00000 1.00000-13 5.45549 host fre12333 hdd 1.81850 osd.33 up 1.00000 1.0000034 hdd 1.81850 osd.34 up 1.00000 1.0000035 hdd 1.81850 osd.35 up 1.00000 1.00000-27 5.45549 host fre20136 hdd 1.81850 osd.36 up 1.00000 1.0000037 hdd 1.81850 osd.37 up 1.00000 1.0000038 hdd 1.81850 osd.38 up 1.00000 1.00000[root@fre201 ~]#[root@fre201 ~]#[root@fre201 ~]#[root@fre201 ~]#[root@fre201 ~]#[root@fre201 ~]# ceph -scluster:id: adb9ad8e-f458-4124-bf58-7963a8d1391fhealth: HEALTH_ERR3 pools have many more objects per pg than average585791/12391450 objects misplaced (4.727%)2 scrub errors2374 PGs pending on creationReduced data availability: 6578 pgs inactive, 2025 pgs down, 74 pgs peering, 1234 pgs stalePossible data damage: 2 pgs inconsistentDegraded data redundancy: 64969/12391450 objects degraded (0.524%), 616 pgs degraded, 20 pgs undersized96242 slow requests are blocked > 32 sec228 stuck requests are blocked > 4096 sectoo many PGs per OSD (2768 > max 200)services:mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02osd: 39 osds: 39 up, 39 in; 96 remapped pgsrgw: 1 daemon activedata:pools: 18 pools, 54656 pgsobjects: 6050k objects, 10942 GBusage: 21900 GB used, 50721 GB / 72622 GB availpgs: 0.002% pgs unknown12.050% pgs not active64969/12391450 objects degraded (0.524%)585791/12391450 objects misplaced (4.727%)47489 active+clean3670 activating1098 stale+down923 down575 activating+degraded563 stale+active+clean105 stale+activating78 activating+remapped72 peering25 stale+activating+degraded23 stale+activating+remapped9 stale+active+undersized6 stale+activating+undersized+degraded+remapped5 stale+active+undersized+degraded4 down+remapped4 activating+degraded+remapped2 active+clean+inconsistent1 stale+activating+degraded+remapped1 stale+active+clean+remapped1 stale+remapped+peering1 remapped+peering1 unknownio:client: 0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wrThanksArunOn Thu, Jan 3, 2019 at 7:19 PM Chris <bitskrieg@xxxxxxxxxxxxx> wrote:If you added OSDs and then deleted them repeatedly without waiting for replication to finish as the cluster attempted to re-balance across them, its highly likely that you are permanently missing PGs (especially if the disks were zapped each time).If those 3 down OSDs can be revived there is a (small) chance that you can right the ship, but 1400pg/OSD is pretty extreme. I'm surprised the cluster even let you do that - this sounds like a data loss event.Bring back the 3 OSD and see what those 2 inconsistent pgs look like with ceph pg query.On January 3, 2019 21:59:38 Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx> wrote:
Hi,Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy tool. Since I was experimenting with tool and ended up deleting OSD nodes on new server couple of times.Now since ceph OSDs are running on new server cluster PGs seems to be inactive (10-15%) and they are not recovering or rebalancing. Not sure what to do. I tried shutting down OSDs on new server.Status:[root@fre105 ~]# ceph -s2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) No such file or directorycluster:id: adb9ad8e-f458-4124-bf58-7963a8d1391fhealth: HEALTH_ERR3 pools have many more objects per pg than average373907/12391198 objects misplaced (3.018%)2 scrub errors9677 PGs pending on creationReduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 2717 pgs stalePossible data damage: 2 pgs inconsistentDegraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized52486 slow requests are blocked > 32 sec9287 stuck requests are blocked > 4096 sectoo many PGs per OSD (2968 > max 200)services:mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02osd: 39 osds: 36 up, 36 in; 51 remapped pgsrgw: 1 daemon activedata:pools: 18 pools, 54656 pgsobjects: 6050k objects, 10941 GBusage: 21727 GB used, 45308 GB / 67035 GB availpgs: 13.073% pgs not active178350/12391198 objects degraded (1.439%)373907/12391198 objects misplaced (3.018%)46177 active+clean5054 down1173 stale+down1084 stale+active+undersized547 activating201 stale+active+undersized+degraded158 stale+activating96 activating+degraded46 stale+active+clean42 activating+remapped34 stale+activating+degraded23 stale+activating+remapped6 stale+activating+undersized+degraded+remapped6 activating+undersized+degraded+remapped2 activating+degraded+remapped2 active+clean+inconsistent1 stale+activating+degraded+remapped1 stale+active+clean+remapped1 stale+remapped1 down+remapped1 remapped+peeringio:client: 0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wrThanks--Arun Poonia_______________________________________________ceph-users mailing list--_______________________________________________Arun Poonia
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--Arun Poonia
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com