replacing OSD nodes

Jesper Lykkegaard Karlsen <jelka@xxxxxxxxx> · Tue, 19 Jul 2022 11:08:23 +0000

Hi all,

Setup: Octopus - erasure 8-3

I had gotten to the point where I had some rather old OSD nodes, that I wanted to replace with new ones.

The procedure was planned like this:

  *   add new replacement OSD nodes
  *   set all OSDs on the retiring nodes to out.
  *   wait for everything to rebalance
  *   remove retiring nodes

All this started out nicely, with about 62% of all objects that needed to be replaced. Existing OSDs was maximum 70% full and with newly added OSDs raw available size was 1.5 PiB (47% used).
A scenario that seemed feasible to run smoothly, at least to me.

After around 50% misplaced objects remaining, the OSDs started to complain about backfillfull OSDs and nearfull OSDs.
A bit of a surprise to me, as RAW size is only 47% used.

It seems that rebalancing does not happen in a prioritized manner, where planed backfill starts with the OSD with most space available space, but "alphabetically" according to pg-name.
Is this really true?

Anyway, I have tried to construct a script that make a prioritized order of rebalancing PGs that are in stuck in "backfill_wait" position and it starts by rebalancing PGs to OSD with to most availble space.
If more shards are moved in a PG, then the OSD with least available space will be selected for the whole PG backfill.

would this work?

#!/bin/bash
LC_NUMERIC=en_US.UTF-8
OSD_DF="$(ceph osd df | awk '{print $1,$15,$16}' | sed 's/\ TiB/T\#/g' | sed 's/\ GiB/G\#/g'| sed 's/\ B/\#/g'| grep ^[0-9] )"
OSD_AVAIL_MAX=$(ceph osd df |awk '{print $5,$6}' | grep B | grep ^[0-9]| sed 's/\ TiB/T/g' | sed 's/\ GiB/G/g' | sed 's/\ B//g' | numfmt --from=iec | sort -n | tail -n1)

for PG in $(ceph pg dump_stuck 2>/dev/null | grep wait | awk '{print $1}' ); do
  CEPH_PG_MAP=$(ceph pg map ${PG})
  PGS_NEW=$(echo ${CEPH_PG_MAP} | awk -F'[' '{print $2}' |  awk -F']' '{print $1}')
  PGS_OLD=$(echo ${CEPH_PG_MAP} | awk -F'[' '{print $3}' |  awk -F']' '{print $1}')

  NUM=1
  OSD_AVAIL=${OSD_AVAIL_MAX}
  OLD_SHARDS=$(echo ${PGS_OLD}| sed 's/\,/\ /g')
  for OLD_SHARD in $(echo ${OLD_SHARDS}); do
      NEW_SHARD=$(echo ${PGS_NEW} | awk -v a="${NUM}" -F',' '{print $a}')
      #echo "OLD_SHARD=${OLD_SHARD} NEW_SHARD=${NEW_SHARD}"
    if [[ ${OLD_SHARD} != ${NEW_SHARD} ]]; then
        PG_MV="${PG_MV_ALL}${OLD_SHARD} ${NEW_SHARD} "
        PG_MV_ALL=${PG_MV}
        OSD_AVAIL_NEW=$(echo ${OSD_DF} | sed s/\#\ /\\n/g | grep ^"${NEW_SHARD} " | awk '{print $2}' | numfmt --from=iec)
        #echo "OSD_AVAIL_NEW=$OSD_AVAIL_NEW"
        if [[ ${OSD_AVAIL_NEW} -lt ${OSD_AVAIL} ]]; then
            OSD_LEAST_AVAIL=${NEW_SHARD}
            OSD_AVAIL=${OSD_AVAIL_NEW}
        fi

    fi

    NEWNUM=$(( ${NUM} + 1 ))
    NUM=${NEWNUM}
  done
  echo "ceph osd pg-upmap-items ${PG} ${PG_MV_ALL}  #${OSD_AVAIL}# bytes available on most full OSD (${OSD_LEAST_AVAIL})"
  unset PG_MV_ALL
  unset     OSD_AVAIL_NEW
done | grep ^ceph | sort -rn -t'#' -k2 | numfmt -d'#' --field 2 --to=iec  | sed s/\#\ /\iB\ /g

The script does not do anything at this point it only puts out "ceph osd pg-upmap-items" commands that then needs to be piped into bash.
They look like this:

ceph osd pg-upmap-items 20.6fa 281 364   #16TiB bytes available on most full OSD (364)
ceph osd pg-upmap-items 20.45 317 413 115 360 85 396 68 374 188 321   #6.2TiB bytes available on most full OSD (321)
ceph osd pg-upmap-items 20.6b9 334 380 110 404 84 347 161 362 6 391   #5.9TiB bytes available on most full OSD (347)
ceph osd pg-upmap-items 20.69e 315 388 148 366 250 404 118 319 102 354   #5.9TiB bytes available on most full OSD (319)
ceph osd pg-upmap-items 20.56 259 368 120 319 52 384 31 349 329 414   #5.9TiB bytes available on most full OSD (319)
ceph osd pg-upmap-items 20.4d 338 410 329 370 93 388 29 351 290 326 64 346   #5.9TiB bytes available on most full OSD (326)
ceph osd pg-upmap-items 20.58 152 332   #5.8TiB bytes available on most full OSD (332)
ceph osd pg-upmap-items 20.7bc 344 322 329 267 72 339 183 410 87 387 53 358 209 177 98 375   #2.2TiB bytes available on most full OSD (267)
ceph osd pg-upmap-items 20.59 73 292 114 414 29 367 110 301 166 353 340 385 83 208   #2.0TiB bytes available on most full OSD (301)
ceph osd pg-upmap-items 20.f 185 395 344 366 32 335 119 317 4 233 316 360 98 408   #1.9TiB bytes available on most full OSD (233)
ceph osd pg-upmap-items 20.734 323 391 86 191 8 379 65 414 58 326 272 362 187 160   #1.9TiB bytes available on most full OSD (191)
ceph osd pg-upmap-items 20.732 342 350 88 234 17 157 234 409 215 346 265 395 14 265   #1.9TiB bytes available on most full OSD (265)
ceph osd pg-upmap-items 20.6fb 332 411 319 159 309 351 102 397 85 377 46 322 24 306 53 200 240 338   #1.9TiB bytes available on most full OSD (306)
ceph osd pg-upmap-items 20.6c5 334 371 30 340 70 266 241 407 3 233 186 356 40 312 294 391   #1.9TiB bytes available on most full OSD (233)
ceph osd pg-upmap-items 20.6b4 344 338 226 389 319 362 309 411 85 379 248 233 121 318 0 254   #1.9TiB bytes available on most full OSD (233)
ceph osd pg-upmap-items 20.6b1 325 292 35 371 347 153 146 390 12 343 88 327 27 355 54 250 192 408   #1.9TiB bytes available on most full OSD (153)
ceph osd pg-upmap-items 20.57 82 389 282 356 103 165 62 284 67 408 252 366   #1.9TiB bytes available on most full OSD (165)
ceph osd pg-upmap-items 20.50 244 355 319 228 154 397 63 317 113 378 97 276 288 150   #1.9TiB bytes available on most full OSD (228)
ceph osd pg-upmap-items 20.47 343 351 107 283 81 332 76 398 160 410 26 378   #1.9TiB bytes available on most full OSD (283)
ceph osd pg-upmap-items 20.3e 56 322 31 283 330 377 107 360 199 309 190 385 78 406   #1.9TiB bytes available on most full OSD (283)
ceph osd pg-upmap-items 20.3b 91 349 312 414 268 386 45 244 125 371   #1.9TiB bytes available on most full OSD (244)
ceph osd pg-upmap-items 20.3a 277 371 290 359 91 415 165 392 107 167   #1.9TiB bytes available on most full OSD (167)
ceph osd pg-upmap-items 20.39 74 175 18 302 240 393 3 269 224 374 194 408 173 364   #1.9TiB bytes available on most full OSD (302)
...........
.......

If I were to set this into effect, I would first set norecover and nobackfill, then run the script and unset norecover and nobackfill again.
But I am uncertain if it would work? Or even if this is a good idea?

It would be nice if Ceph did something similar automatically 🙂
Or maybe Ceph already does something similar, and I have just not been able to find it?

If Ceph were to do this, it could be nice if the priority of backfill_wait PGs was rerun, perharps every 24 hours, as OSD availability landscape of course changes during backfill.

I imagine this, especially, could stabilize recovery/rebalance on systems where space is a little tight.

Best regards,
Jesper

--------------------------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: jelka@xxxxxxxxx
Tlf:    +45 50906203

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx