Hi Chris,
Indeed that's what happened. I didn't set noout flag either and I did zapped disk on new server every time. In my cluster status fre201 is only new server.
Current Status after enabling 3 OSDs on fre201 host.
[root@fre201 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 70.92137 root default
-2 5.45549 host fre101
0 hdd 1.81850 osd.0 up 1.00000 1.00000
1 hdd 1.81850 osd.1 up 1.00000 1.00000
2 hdd 1.81850 osd.2 up 1.00000 1.00000
-9 5.45549 host fre103
3 hdd 1.81850 osd.3 up 1.00000 1.00000
4 hdd 1.81850 osd.4 up 1.00000 1.00000
5 hdd 1.81850 osd.5 up 1.00000 1.00000
-3 5.45549 host fre105
6 hdd 1.81850 osd.6 up 1.00000 1.00000
7 hdd 1.81850 osd.7 up 1.00000 1.00000
8 hdd 1.81850 osd.8 up 1.00000 1.00000
-4 5.45549 host fre107
9 hdd 1.81850 osd.9 up 1.00000 1.00000
10 hdd 1.81850 osd.10 up 1.00000 1.00000
11 hdd 1.81850 osd.11 up 1.00000 1.00000
-5 5.45549 host fre109
12 hdd 1.81850 osd.12 up 1.00000 1.00000
13 hdd 1.81850 osd.13 up 1.00000 1.00000
14 hdd 1.81850 osd.14 up 1.00000 1.00000
-6 5.45549 host fre111
15 hdd 1.81850 osd.15 up 1.00000 1.00000
16 hdd 1.81850 osd.16 up 1.00000 1.00000
17 hdd 1.81850 osd.17 up 0.79999 1.00000
-7 5.45549 host fre113
18 hdd 1.81850 osd.18 up 1.00000 1.00000
19 hdd 1.81850 osd.19 up 1.00000 1.00000
20 hdd 1.81850 osd.20 up 1.00000 1.00000
-8 5.45549 host fre115
21 hdd 1.81850 osd.21 up 1.00000 1.00000
22 hdd 1.81850 osd.22 up 1.00000 1.00000
23 hdd 1.81850 osd.23 up 1.00000 1.00000
-10 5.45549 host fre117
24 hdd 1.81850 osd.24 up 1.00000 1.00000
25 hdd 1.81850 osd.25 up 1.00000 1.00000
26 hdd 1.81850 osd.26 up 1.00000 1.00000
-11 5.45549 host fre119
27 hdd 1.81850 osd.27 up 1.00000 1.00000
28 hdd 1.81850 osd.28 up 1.00000 1.00000
29 hdd 1.81850 osd.29 up 1.00000 1.00000
-12 5.45549 host fre121
30 hdd 1.81850 osd.30 up 1.00000 1.00000
31 hdd 1.81850 osd.31 up 1.00000 1.00000
32 hdd 1.81850 osd.32 up 1.00000 1.00000
-13 5.45549 host fre123
33 hdd 1.81850 osd.33 up 1.00000 1.00000
34 hdd 1.81850 osd.34 up 1.00000 1.00000
35 hdd 1.81850 osd.35 up 1.00000 1.00000
-27 5.45549 host fre201
36 hdd 1.81850 osd.36 up 1.00000 1.00000
37 hdd 1.81850 osd.37 up 1.00000 1.00000
38 hdd 1.81850 osd.38 up 1.00000 1.00000
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]# ceph -s
cluster:
id: adb9ad8e-f458-4124-bf58-7963a8d1391f
health: HEALTH_ERR
3 pools have many more objects per pg than average
585791/12391450 objects misplaced (4.727%)
2 scrub errors
2374 PGs pending on creation
Reduced data availability: 6578 pgs inactive, 2025 pgs down, 74 pgs peering, 1234 pgs stale
Possible data damage: 2 pgs inconsistent
Degraded data redundancy: 64969/12391450 objects degraded (0.524%), 616 pgs degraded, 20 pgs undersized
96242 slow requests are blocked > 32 sec
228 stuck requests are blocked > 4096 sec
too many PGs per OSD (2768 > max 200)
services:
mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
osd: 39 osds: 39 up, 39 in; 96 remapped pgs
rgw: 1 daemon active
data:
pools: 18 pools, 54656 pgs
objects: 6050k objects, 10942 GB
usage: 21900 GB used, 50721 GB / 72622 GB avail
pgs: 0.002% pgs unknown
12.050% pgs not active
64969/12391450 objects degraded (0.524%)
585791/12391450 objects misplaced (4.727%)
47489 active+clean
3670 activating
1098 stale+down
923 down
575 activating+degraded
563 stale+active+clean
105 stale+activating
78 activating+remapped
72 peering
25 stale+activating+degraded
23 stale+activating+remapped
9 stale+active+undersized
6 stale+activating+undersized+degraded+remapped
5 stale+active+undersized+degraded
4 down+remapped
4 activating+degraded+remapped
2 active+clean+inconsistent
1 stale+activating+degraded+remapped
1 stale+active+clean+remapped
1 stale+remapped+peering
1 remapped+peering
1 unknown
io:
client: 0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wr
Thanks
Arun
On Thu, Jan 3, 2019 at 7:19 PM Chris <bitskrieg@xxxxxxxxxxxxx> wrote:
If you added OSDs and then deleted them repeatedly without waiting for replication to finish as the cluster attempted to re-balance across them, its highly likely that you are permanently missing PGs (especially if the disks were zapped each time).If those 3 down OSDs can be revived there is a (small) chance that you can right the ship, but 1400pg/OSD is pretty extreme. I'm surprised the cluster even let you do that - this sounds like a data loss event.Bring back the 3 OSD and see what those 2 inconsistent pgs look like with ceph pg query.On January 3, 2019 21:59:38 Arun POONIA <arun.poonia@xxxxxxxxxxxxxxxxx> wrote:
Hi,Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy tool. Since I was experimenting with tool and ended up deleting OSD nodes on new server couple of times.Now since ceph OSDs are running on new server cluster PGs seems to be inactive (10-15%) and they are not recovering or rebalancing. Not sure what to do. I tried shutting down OSDs on new server.Status:[root@fre105 ~]# ceph -s2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) No such file or directorycluster:id: adb9ad8e-f458-4124-bf58-7963a8d1391fhealth: HEALTH_ERR3 pools have many more objects per pg than average373907/12391198 objects misplaced (3.018%)2 scrub errors9677 PGs pending on creationReduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 2717 pgs stalePossible data damage: 2 pgs inconsistentDegraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized52486 slow requests are blocked > 32 sec9287 stuck requests are blocked > 4096 sectoo many PGs per OSD (2968 > max 200)services:mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02osd: 39 osds: 36 up, 36 in; 51 remapped pgsrgw: 1 daemon activedata:pools: 18 pools, 54656 pgsobjects: 6050k objects, 10941 GBusage: 21727 GB used, 45308 GB / 67035 GB availpgs: 13.073% pgs not active178350/12391198 objects degraded (1.439%)373907/12391198 objects misplaced (3.018%)46177 active+clean5054 down1173 stale+down1084 stale+active+undersized547 activating201 stale+active+undersized+degraded158 stale+activating96 activating+degraded46 stale+active+clean42 activating+remapped34 stale+activating+degraded23 stale+activating+remapped6 stale+activating+undersized+degraded+remapped6 activating+undersized+degraded+remapped2 activating+degraded+remapped2 active+clean+inconsistent1 stale+activating+degraded+remapped1 stale+active+clean+remapped1 stale+remapped1 down+remapped1 remapped+peeringio:client: 0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wrThanks--Arun Poonia_______________________________________________ceph-users mailing list
Arun Poonia
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com