Hi all, Setup: Octopus - erasure 8-3 I had gotten to the point where I had some rather old OSD nodes, that I wanted to replace with new ones. The procedure was planned like this: * add new replacement OSD nodes * set all OSDs on the retiring nodes to out. * wait for everything to rebalance * remove retiring nodes All this started out nicely, with about 62% of all objects that needed to be replaced. Existing OSDs was maximum 70% full and with newly added OSDs raw available size was 1.5 PiB (47% used). A scenario that seemed feasible to run smoothly, at least to me. After around 50% misplaced objects remaining, the OSDs started to complain about backfillfull OSDs and nearfull OSDs. A bit of a surprise to me, as RAW size is only 47% used. It seems that rebalancing does not happen in a prioritized manner, where planed backfill starts with the OSD with most space available space, but "alphabetically" according to pg-name. Is this really true? Anyway, I have tried to construct a script that make a prioritized order of rebalancing PGs that are in stuck in "backfill_wait" position and it starts by rebalancing PGs to OSD with to most availble space. If more shards are moved in a PG, then the OSD with least available space will be selected for the whole PG backfill. would this work? #!/bin/bash LC_NUMERIC=en_US.UTF-8 OSD_DF="$(ceph osd df | awk '{print $1,$15,$16}' | sed 's/\ TiB/T\#/g' | sed 's/\ GiB/G\#/g'| sed 's/\ B/\#/g'| grep ^[0-9] )" OSD_AVAIL_MAX=$(ceph osd df |awk '{print $5,$6}' | grep B | grep ^[0-9]| sed 's/\ TiB/T/g' | sed 's/\ GiB/G/g' | sed 's/\ B//g' | numfmt --from=iec | sort -n | tail -n1) for PG in $(ceph pg dump_stuck 2>/dev/null | grep wait | awk '{print $1}' ); do CEPH_PG_MAP=$(ceph pg map ${PG}) PGS_NEW=$(echo ${CEPH_PG_MAP} | awk -F'[' '{print $2}' | awk -F']' '{print $1}') PGS_OLD=$(echo ${CEPH_PG_MAP} | awk -F'[' '{print $3}' | awk -F']' '{print $1}') NUM=1 OSD_AVAIL=${OSD_AVAIL_MAX} OLD_SHARDS=$(echo ${PGS_OLD}| sed 's/\,/\ /g') for OLD_SHARD in $(echo ${OLD_SHARDS}); do NEW_SHARD=$(echo ${PGS_NEW} | awk -v a="${NUM}" -F',' '{print $a}') #echo "OLD_SHARD=${OLD_SHARD} NEW_SHARD=${NEW_SHARD}" if [[ ${OLD_SHARD} != ${NEW_SHARD} ]]; then PG_MV="${PG_MV_ALL}${OLD_SHARD} ${NEW_SHARD} " PG_MV_ALL=${PG_MV} OSD_AVAIL_NEW=$(echo ${OSD_DF} | sed s/\#\ /\\n/g | grep ^"${NEW_SHARD} " | awk '{print $2}' | numfmt --from=iec) #echo "OSD_AVAIL_NEW=$OSD_AVAIL_NEW" if [[ ${OSD_AVAIL_NEW} -lt ${OSD_AVAIL} ]]; then OSD_LEAST_AVAIL=${NEW_SHARD} OSD_AVAIL=${OSD_AVAIL_NEW} fi fi NEWNUM=$(( ${NUM} + 1 )) NUM=${NEWNUM} done echo "ceph osd pg-upmap-items ${PG} ${PG_MV_ALL} #${OSD_AVAIL}# bytes available on most full OSD (${OSD_LEAST_AVAIL})" unset PG_MV_ALL unset OSD_AVAIL_NEW done | grep ^ceph | sort -rn -t'#' -k2 | numfmt -d'#' --field 2 --to=iec | sed s/\#\ /\iB\ /g The script does not do anything at this point it only puts out "ceph osd pg-upmap-items" commands that then needs to be piped into bash. They look like this: ceph osd pg-upmap-items 20.6fa 281 364 #16TiB bytes available on most full OSD (364) ceph osd pg-upmap-items 20.45 317 413 115 360 85 396 68 374 188 321 #6.2TiB bytes available on most full OSD (321) ceph osd pg-upmap-items 20.6b9 334 380 110 404 84 347 161 362 6 391 #5.9TiB bytes available on most full OSD (347) ceph osd pg-upmap-items 20.69e 315 388 148 366 250 404 118 319 102 354 #5.9TiB bytes available on most full OSD (319) ceph osd pg-upmap-items 20.56 259 368 120 319 52 384 31 349 329 414 #5.9TiB bytes available on most full OSD (319) ceph osd pg-upmap-items 20.4d 338 410 329 370 93 388 29 351 290 326 64 346 #5.9TiB bytes available on most full OSD (326) ceph osd pg-upmap-items 20.58 152 332 #5.8TiB bytes available on most full OSD (332) ceph osd pg-upmap-items 20.7bc 344 322 329 267 72 339 183 410 87 387 53 358 209 177 98 375 #2.2TiB bytes available on most full OSD (267) ceph osd pg-upmap-items 20.59 73 292 114 414 29 367 110 301 166 353 340 385 83 208 #2.0TiB bytes available on most full OSD (301) ceph osd pg-upmap-items 20.f 185 395 344 366 32 335 119 317 4 233 316 360 98 408 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.734 323 391 86 191 8 379 65 414 58 326 272 362 187 160 #1.9TiB bytes available on most full OSD (191) ceph osd pg-upmap-items 20.732 342 350 88 234 17 157 234 409 215 346 265 395 14 265 #1.9TiB bytes available on most full OSD (265) ceph osd pg-upmap-items 20.6fb 332 411 319 159 309 351 102 397 85 377 46 322 24 306 53 200 240 338 #1.9TiB bytes available on most full OSD (306) ceph osd pg-upmap-items 20.6c5 334 371 30 340 70 266 241 407 3 233 186 356 40 312 294 391 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.6b4 344 338 226 389 319 362 309 411 85 379 248 233 121 318 0 254 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.6b1 325 292 35 371 347 153 146 390 12 343 88 327 27 355 54 250 192 408 #1.9TiB bytes available on most full OSD (153) ceph osd pg-upmap-items 20.57 82 389 282 356 103 165 62 284 67 408 252 366 #1.9TiB bytes available on most full OSD (165) ceph osd pg-upmap-items 20.50 244 355 319 228 154 397 63 317 113 378 97 276 288 150 #1.9TiB bytes available on most full OSD (228) ceph osd pg-upmap-items 20.47 343 351 107 283 81 332 76 398 160 410 26 378 #1.9TiB bytes available on most full OSD (283) ceph osd pg-upmap-items 20.3e 56 322 31 283 330 377 107 360 199 309 190 385 78 406 #1.9TiB bytes available on most full OSD (283) ceph osd pg-upmap-items 20.3b 91 349 312 414 268 386 45 244 125 371 #1.9TiB bytes available on most full OSD (244) ceph osd pg-upmap-items 20.3a 277 371 290 359 91 415 165 392 107 167 #1.9TiB bytes available on most full OSD (167) ceph osd pg-upmap-items 20.39 74 175 18 302 240 393 3 269 224 374 194 408 173 364 #1.9TiB bytes available on most full OSD (302) ........... ....... If I were to set this into effect, I would first set norecover and nobackfill, then run the script and unset norecover and nobackfill again. But I am uncertain if it would work? Or even if this is a good idea? It would be nice if Ceph did something similar automatically 🙂 Or maybe Ceph already does something similar, and I have just not been able to find it? If Ceph were to do this, it could be nice if the priority of backfill_wait PGs was rerun, perharps every 24 hours, as OSD availability landscape of course changes during backfill. I imagine this, especially, could stabilize recovery/rebalance on systems where space is a little tight. Best regards, Jesper -------------------------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: jelka@xxxxxxxxx Tlf: +45 50906203 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx