recovered osds come back into cluster with 2-3 times the data

"Andrew Ferris" <Andrew.Ferris@xxxxxxxxxx> · Wed, 31 Jan 2018 14:00:51 -0800

We are running jewel (10.2.10) on our Ceph cluster with 6 OSDs and 3 MONs. 144 8TB drives across the 6 OSD hosts with uniform weights. 

In tests to simulate the failure of one entire OSD host or even just a few drives on an OSD host we see that each osd drive we add back in comes back in with at least twice as much data as before.

Here's a snippet of a "ceph osd df tree" to show what I mean:

ID  WEIGHT     REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS TYPE NAME
 -6  174.59921        -  174T 16530G  158T  9.25 1.18   0     host osd05
 96    7.27499  1.00000 7449G   520G 6929G  6.98 0.89 185         osd.96
 97    7.27499  1.00000 7449G   458G 6990G  6.16 0.78 179         osd.97
 98    7.27499  1.00000 7449G   415G 7033G  5.58 0.71 172         osd.98
 99    7.27499  1.00000 7449G   475G 6974G  6.38 0.81 168         osd.99
100    7.27499  1.00000 7449G   480G 6968G  6.45 0.82 175         osd.100
101    7.27499  1.00000 7449G   407G 7041G  5.47 0.70 174         osd.101
102    7.27499  1.00000 7449G   476G 6972G  6.40 0.81 187         osd.102
103    7.27499  1.00000 7449G   513G 6936G  6.89 0.88 170         osd.103
104    7.27499  1.00000 7449G   423G 7025G  5.69 0.72 175         osd.104
105    7.27499  1.00000 7449G   469G 6980G  6.30 0.80 170         osd.105
106    7.27499  1.00000 7449G   373G 7076G  5.01 0.64 177         osd.106
107    7.27499  1.00000 7449G   467G 6982G  6.27 0.80 180         osd.107
108    7.27499  1.00000 7449G   497G 6951G  6.68 0.85 166         osd.108
109    7.27499  1.00000 7449G   495G 6953G  6.66 0.85 174         osd.109
110    7.27499  1.00000 7449G   428G 7020G  5.75 0.73 172         osd.110
111    7.27499  1.00000 7449G   488G 6961G  6.55 0.83 191         osd.111
112    7.27499  1.00000 7449G   619G 6830G  8.31 1.06 200         osd.112
113    7.27499  1.00000 7449G   467G 6981G  6.28 0.80 174         osd.113
116    7.27489  1.00000 7449G  1324G 6124G 17.78 2.26 184         osd.116
117    7.27489  1.00000 7449G  1491G 5958G 20.02 2.55 210         osd.117
118    7.27489  1.00000 7449G  1277G 6171G 17.15 2.18 176         osd.118
119    7.27489  1.00000 7449G  1379G 6070G 18.51 2.35 191         osd.119
114    7.27489  1.00000 7449G  1358G 6090G 18.24 2.32 197         osd.114
115    7.27489  1.00000 7449G  1218G 6231G 16.35 2.08 173         osd.115

Drives 114 through to 119 have been redeployed as if they failed out-right and they have elevated USE and %USED compared to the other 18 that have not been redeployed.

To remove the osds we do the following commands, with correct osd info, after stopping the osd.service(s):

ceph osd crush reweight  0
ceph osd out
ceph osd crush remove
ceph auth del
ceph osd rm 

then we clear the partition on the disks involved

we add them back via 

ceph deploy osd create

After the recovery is finished the cluster is at HEALTH_OK but with the above unbalanced drives. 

cluster xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
     health HEALTH_OK
     monmap e1: 3 mons at {mon01=10.0.X.X:6789/0,mon02=10.0.X.X:6789/0,mon03=10.0.X.X:6789/0}
            election epoch 150, quorum 0,1,2 mon01,mon02,mon03
     osdmap e50114: 144 osds: 144 up, 144 in
            flags sortbitwise,require_jewel_osds
      pgmap v4825786: 8704 pgs, 5 pools, 61355 GB data, 26797 kobjects
            84343 GB used, 965 TB / 1047 TB avail
                8693 active+clean
                   7 active+clean+scrubbing
                   4 active+clean+scrubbing+deep
  client io 195 kB/s rd, 714 op/s rd, 0 op/s wr

We also noticed that the performance of the pools that were around before this test process is 30% slower in FIO write tests. If we create a brand new pool then the performance is not slower.

We have no idea why these osds are coming in at 2 or 3 times the USE value so thanks for any help on this,

Andrew Ferris
Network & System Management
UBC Centre for Heart & Lung Innovation
St. Paul's Hospital, Vancouver
http://www.hli.ubc.ca

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com