We are running jewel (10.2.10) on our Ceph cluster with 6 OSDs and 3 MONs. 144 8TB drives across the 6 OSD hosts with uniform weights. In tests to simulate the failure of one entire OSD host or even just a few drives on an OSD host we see that each osd drive we add back in comes back in with at least twice as much data as before. Here's a snippet of a "ceph osd df tree" to show what I mean: ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME -6 174.59921 - 174T 16530G 158T 9.25 1.18 0 host osd05 96 7.27499 1.00000 7449G 520G 6929G 6.98 0.89 185 osd.96 97 7.27499 1.00000 7449G 458G 6990G 6.16 0.78 179 osd.97 98 7.27499 1.00000 7449G 415G 7033G 5.58 0.71 172 osd.98 99 7.27499 1.00000 7449G 475G 6974G 6.38 0.81 168 osd.99 100 7.27499 1.00000 7449G 480G 6968G 6.45 0.82 175 osd.100 101 7.27499 1.00000 7449G 407G 7041G 5.47 0.70 174 osd.101 102 7.27499 1.00000 7449G 476G 6972G 6.40 0.81 187 osd.102 103 7.27499 1.00000 7449G 513G 6936G 6.89 0.88 170 osd.103 104 7.27499 1.00000 7449G 423G 7025G 5.69 0.72 175 osd.104 105 7.27499 1.00000 7449G 469G 6980G 6.30 0.80 170 osd.105 106 7.27499 1.00000 7449G 373G 7076G 5.01 0.64 177 osd.106 107 7.27499 1.00000 7449G 467G 6982G 6.27 0.80 180 osd.107 108 7.27499 1.00000 7449G 497G 6951G 6.68 0.85 166 osd.108 109 7.27499 1.00000 7449G 495G 6953G 6.66 0.85 174 osd.109 110 7.27499 1.00000 7449G 428G 7020G 5.75 0.73 172 osd.110 111 7.27499 1.00000 7449G 488G 6961G 6.55 0.83 191 osd.111 112 7.27499 1.00000 7449G 619G 6830G 8.31 1.06 200 osd.112 113 7.27499 1.00000 7449G 467G 6981G 6.28 0.80 174 osd.113 116 7.27489 1.00000 7449G 1324G 6124G 17.78 2.26 184 osd.116 117 7.27489 1.00000 7449G 1491G 5958G 20.02 2.55 210 osd.117 118 7.27489 1.00000 7449G 1277G 6171G 17.15 2.18 176 osd.118 119 7.27489 1.00000 7449G 1379G 6070G 18.51 2.35 191 osd.119 114 7.27489 1.00000 7449G 1358G 6090G 18.24 2.32 197 osd.114 115 7.27489 1.00000 7449G 1218G 6231G 16.35 2.08 173 osd.115 Drives 114 through to 119 have been redeployed as if they failed out-right and they have elevated USE and %USED compared to the other 18 that have not been redeployed. To remove the osds we do the following commands, with correct osd info, after stopping the osd.service(s): ceph osd crush reweight 0 ceph osd out ceph osd crush remove ceph auth del ceph osd rm then we clear the partition on the disks involved we add them back via ceph deploy osd create After the recovery is finished the cluster is at HEALTH_OK but with the above unbalanced drives. cluster xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx health HEALTH_OK monmap e1: 3 mons at {mon01=10.0.X.X:6789/0,mon02=10.0.X.X:6789/0,mon03=10.0.X.X:6789/0} election epoch 150, quorum 0,1,2 mon01,mon02,mon03 osdmap e50114: 144 osds: 144 up, 144 in flags sortbitwise,require_jewel_osds pgmap v4825786: 8704 pgs, 5 pools, 61355 GB data, 26797 kobjects 84343 GB used, 965 TB / 1047 TB avail 8693 active+clean 7 active+clean+scrubbing 4 active+clean+scrubbing+deep client io 195 kB/s rd, 714 op/s rd, 0 op/s wr We also noticed that the performance of the pools that were around before this test process is 30% slower in FIO write tests. If we create a brand new pool then the performance is not slower. We have no idea why these osds are coming in at 2 or 3 times the USE value so thanks for any help on this, Andrew Ferris Network & System Management UBC Centre for Heart & Lung Innovation St. Paul's Hospital, Vancouver http://www.hli.ubc.ca _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com