On 06/02/17 17:38, Sage Weil wrote: > I don't see how this would be any different from a peering perspective. > The pattern of data movement and remapping would be different, but there's > no difference in this sequence that seems like it relate to peering > taking 10s of seconds. :/ > > How confident are you that this was a real effect? Could it be that when > you tried the second method your disk caches were warm vs the first time > around when they were cold? > > sage After the new disks are added, much more confident. See below... one time I crush weighted 6 at once, with issues, and the other times it was other disks, with no issues if I don't crush reweight too many at once. On 06/04/17 00:58, Peter Maloney wrote: > On 06/03/17 09:51, Dan van der Ster wrote: >> On Fri, Jun 2, 2017 at 4:05 PM, Peter Maloney >> <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote: >>> ... >>> And Sage, if that's true, then couldn't ceph by default just do the >>> first kind of peering work before any pgs, pools, clients, etc. are >>> affected, before moving on to the stuff that affects clients, regardless >>> of which steps were used? At some point during adding t hose 2 nodes I >>> was thinking how could ceph be so broken and mysterious... why does it >>> just hang there? Would it do this during recovery of a dead osd too? Now >>> I know how to avoid it and that it shouldn't affect recovering dead osds >>> (not changing crush weight)... but it would be nice for all users not to >>> ever think that way. :) >>> >>> ... >> Here's what we do: >> 1. Create and start new OSDs with initial crush weight = 0.0. No PGs >> should re-peer when these are booted. >> 2. Run the reweight script, e.g. like this for some 6T drives: >> >> ceph-gentle-reweight -o osd.10,osd.11,osd.12 -l 15 -b 50 -d 0.01 -t 5.46 >> >> In practice we've added >150 drives at once with that script -- using >> that tiny delta. >> >> We use crush reweight because it "works for us (tm)". We haven't seen >> any strange peering hangs, though we exercise this on hammer, not >> (yet) jewel. >> I hadn't thought of your method using osd reweight -- how do you add >> new osds with an initial osd reweight? Maybe you create the osds in a >> non-default root then move them after being reweighted to 0.0? >> >> Cheers, Dan > I added them with crush weight 0, then my plan was to raise the weight > like you do. That's basically what I did for all the other servers. But > I fiddled with the crush map and had them in another root when I set the > reweight 0, then weight 6, then moved them to root default (long > peering), then reweight 1 (short peering). But that wasn't what I > planned on doing or plan to do in the future. > > I expect that would be the same as crush weight 0 and in the normal root > when created, then when ready for peering, set reweight 0 first, then > crush weight 6, then after peering is done, reweight 1 for a few at a > time (ceph osd reweight ...; sleep 2; while ceph health | grep peering; > do sleep 1; done ...). > > The next step in this upgrade is to replace 18 2TB disks with 6TB > ones... I'll do it that way and find out if it works without the extra root. So I'm done removing the 18 2TB disks and adding the 6TB ones (plus replacing a dead one). I did 6 disks at a time (all the 2TB disks on each node). I didn't test raising weight slowly, but I did test that setting the weight straight to 6 on all at once (with reweight still 0) causes client issues. (but reweight to 1 all at once, multi-process even, like I do here works fine) Here's the script that does the job well. First have the new osds created with weight 0, and daemons running. Then this script finds them by weight 0 and works with them: > # list osds with hosts next to them for easy filtering with awk > (doesn't support chassis, rack, etc. buckets) > ceph_list_osd() { > ceph osd tree | awk ' > BEGIN {found=0; host=""}; > $3 == "host" {found=1; host=$4; getline}; > $3 == "host" {found=0} > found || $3 ~ /osd\./ {print $0 " " host}' > } > > peering_sleep() { > echo "sleeping" > sleep 2 > while ceph health | grep -q peer; do > echo -n . > sleep 1 > done > echo > sleep 5 > } > > # after an osd is already created, this reweights them to 'activate' them > ceph_activate_osds() { > weight="$1" > host=$(hostname -s) > > if [ -z "$weight" ]; then > weight=6.00099 > fi > > # for crush weight 0 osds, set reweight 0 so the crush weight > non-zero won't cause as many blocked requests > for id in $(ceph_list_osd | awk '$2 == 0 {print $1}'); do > ceph osd reweight $id 0 & > done > wait > peering_sleep > > # the harsh reweight which we do slowly > for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 == > host {print $1}'); do > echo ceph osd crush reweight "osd.$id" "$weight" > ceph osd crush reweight "osd.$id" "$weight" > peering_sleep > done > > # the light reweight > for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 == > host {print $1}'); do > ceph osd reweight $id 1 & > done > wait > } and the ceph status in case it's somehow useful: > root@ceph1:~ # ceph -s > cluster 684e4a3f-25fb-4b78-8756-62befa9be15e > health HEALTH_WARN > 756 pgs backfill_wait > 6 pgs backfilling > 260 pgs degraded > 183 pgs recovery_wait > 260 pgs stuck degraded > 945 pgs stuck unclean > 60 pgs stuck undersized > 60 pgs undersized > recovery 494450/38357551 objects degraded (1.289%) > recovery 26900171/38357551 objects misplaced (70.130%) > monmap e3: 3 mons at > {ceph1=10.3.0.131:6789/0,ceph2=10.3.0.132:6789/0,ceph3=10.3.0.133:6789/0} > election epoch 614, quorum 0,1,2 ceph1,ceph2,ceph3 > fsmap e322: 1/1/1 up {0=ceph2=up:active}, 2 up:standby > osdmap e119625: 60 osds: 60 up, 60 in; 933 remapped pgs > flags sortbitwise,require_jewel_osds > pgmap v19175947: 1152 pgs, 4 pools, 31301 GB data, 8172 kobjects > 94851 GB used, 212 TB / 305 TB avail > 494450/38357551 objects degraded (1.289%) > 26900171/38357551 objects misplaced (70.130%) > 685 active+remapped+wait_backfill > 200 active+clean > 164 active+recovery_wait+degraded+remapped > 52 active+undersized+degraded+remapped+wait_backfill > 19 active+degraded+remapped+wait_backfill > 12 active+recovery_wait+degraded > 7 active+clean+scrubbing > 7 active+recovery_wait+undersized+degraded+remapped > 5 active+degraded+remapped+backfilling > 1 active+undersized+degraded+remapped+backfilling > recovery io 900 MB/s, 240 objects/s > client io 79721 B/s rd, 10418 kB/s wr, 19 op/s rd, 137 op/s wr > > root@ceph1:~ # ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 336.06061 root default > -2 64.01199 host ceph1 > 0 4.00099 osd.0 up 0.61998 1.00000 > 1 4.00099 osd.1 up 0.59834 1.00000 > 2 4.00099 osd.2 up 0.79213 1.00000 > 27 4.00099 osd.27 up 0.69460 1.00000 > 30 6.00099 osd.30 up 0.73935 1.00000 > 31 6.00099 osd.31 up 0.81180 1.00000 > 10 6.00099 osd.10 up 0.64571 1.00000 > 12 6.00099 osd.12 up 0.94655 1.00000 > 13 6.00099 osd.13 up 0.75957 1.00000 > 14 6.00099 osd.14 up 0.77515 1.00000 > 15 6.00099 osd.15 up 0.74663 1.00000 > 16 6.00099 osd.16 up 0.93401 1.00000 > -3 64.01181 host ceph2 > 3 4.00099 osd.3 up 0.69209 1.00000 > 4 4.00099 osd.4 up 0.75365 1.00000 > 5 4.00099 osd.5 up 0.80797 1.00000 > 28 4.00099 osd.28 up 0.66307 1.00000 > 32 6.00099 osd.32 up 0.81369 1.00000 > 33 6.00099 osd.33 up 1.00000 1.00000 > 9 6.00098 osd.9 up 0.58499 1.00000 > 17 6.00098 osd.17 up 0.90613 1.00000 > 18 6.00098 osd.18 up 0.73138 1.00000 > 19 6.00098 osd.19 up 0.80649 1.00000 > 20 6.00098 osd.20 up 0.51999 1.00000 > 21 6.00098 osd.21 up 0.79404 1.00000 > -4 64.01181 host ceph3 > 6 4.00099 osd.6 up 0.56717 1.00000 > 7 4.00099 osd.7 up 0.72240 1.00000 > 8 4.00099 osd.8 up 0.79919 1.00000 > 29 4.00099 osd.29 up 0.80109 1.00000 > 34 6.00099 osd.34 up 0.71120 1.00000 > 35 6.00099 osd.35 up 0.63611 1.00000 > 11 6.00098 osd.11 up 0.67000 1.00000 > 22 6.00098 osd.22 up 0.80756 1.00000 > 23 6.00098 osd.23 up 0.67000 1.00000 > 24 6.00098 osd.24 up 0.71599 1.00000 > 25 6.00098 osd.25 up 0.64540 1.00000 > 26 6.00098 osd.26 up 0.76378 1.00000 > -5 72.01199 host ceph4 > 36 6.00099 osd.36 up 0.74846 1.00000 > 37 6.00099 osd.37 up 0.71387 1.00000 > 38 6.00099 osd.38 up 0.71129 1.00000 > 39 6.00099 osd.39 up 0.76547 1.00000 > 40 6.00099 osd.40 up 0.73967 1.00000 > 41 6.00099 osd.41 up 0.64742 1.00000 > 42 6.00099 osd.42 up 0.81006 1.00000 > 44 6.00099 osd.44 up 0.65381 1.00000 > 45 6.00099 osd.45 up 0.77457 1.00000 > 46 6.00099 osd.46 up 0.82390 1.00000 > 47 6.00099 osd.47 up 0.85431 1.00000 > 43 6.00099 osd.43 up 0.64775 1.00000 > -6 72.01300 host ceph5 > 48 6.00099 osd.48 up 0.71269 1.00000 > 49 6.00099 osd.49 up 0.97649 1.00000 > 50 6.00099 osd.50 up 0.98079 1.00000 > 51 6.00099 osd.51 up 0.75307 1.00000 > 52 6.00099 osd.52 up 0.86545 1.00000 > 53 6.00099 osd.53 up 0.64278 1.00000 > 54 6.00099 osd.54 up 0.94551 1.00000 > 55 6.00099 osd.55 up 0.73465 1.00000 > 56 6.00099 osd.56 up 0.69908 1.00000 > 57 6.00099 osd.57 up 0.78789 1.00000 > 58 6.00099 osd.58 up 0.89081 1.00000 > 59 6.00099 osd.59 up 0.66379 1.00000 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html