Is there a bug in backfill scheduling?

Frank Schilder <frans@xxxxxx> · Sat, 17 Dec 2022 10:54:10 +0000

Hi all,

there have been many reports about too slow backfill lately and mostly they seemed related to a problem with mclock ops scheduling n quincy. The hallmark was that backfill started fast and then slowed down a lot. I now make the same observation on an octopus cluster with wpq and it looks very suspicious of a problem with scheduling backfill operations. Here what I see:

We added 95 disks to a set of disks shared by 2 pools. This is about 8% of the total number of disks and they were distributed over all 12 OSD hosts. The 2 pools are 8+2 and 8+3 EC fs-data pools. Initially the backfill was as fast as expected, but over the last day was really slow (compared with expectation). Only 33 PGs were backfilling. I have osd_max_backfills=3 and a simple estimate says there should be between 100 - 200 PGs backfilling.

To speed things up, I increased osd_max_backfills=5 and the number of backfilling PGs jumped right up to over 200. That's way more than the relative increase would warrant. Just to check, I set osd_max_backfills=3 again to see if the number of PGs goes back to about 30 again. But no! Now I have 142 PGs backfilling, as expected.

This looks very much like PGs eligible for backfill don't start or backfill reservations are removed for some reason. Can anyone help out here what might be the problem? I don't want to start a cron job to set osd_max_backfills up and down. There must be something else at play here. Output of ceph status and config set commands below.

The number of backfilling PGs is decreasing again and I would really like this to be stable by itself. To give an idea of the problem, we talk here about a rebalancing taking either 2 weeks or 2 months. That's not a bagatelle issue.

Thanks and best regards,
Frank

[root@gnosis ~]# ceph config dump | sed -e "s/  */ /g" | grep :hdd | grep osd_max_backfills
 osd class:hdd advanced osd_max_backfills 3 
[root@gnosis ~]# ceph status
  cluster:
    id:     ###
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 7d)
    mgr: ceph-25(active, since 10w), standbys: ceph-03, ceph-02, ceph-01, ceph-26
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1260 osds: 1260 up (since 2d), 1260 in (since 2d); 6487 remapped pgs

  task status:

  data:
    pools:   14 pools, 25065 pgs
    objects: 1.49G objects, 2.8 PiB
    usage:   3.4 PiB used, 9.7 PiB / 13 PiB avail
    pgs:     2466697364/12910834502 objects misplaced (19.106%)
             18571 active+clean
             6453  active+remapped+backfill_wait
             34    active+remapped+backfilling
             7     active+clean+snaptrim

  io:
    client:   30 MiB/s rd, 221 MiB/s wr, 1.08k op/s rd, 1.54k op/s wr
    recovery: 1.0 GiB/s, 380 objects/s

[root@gnosis ~]# ceph config set osd/class:hdd osd_max_backfills 5
[root@gnosis ~]# ceph status
  cluster:
    id:     ###
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 7d)
    mgr: ceph-25(active, since 10w), standbys: ceph-03, ceph-02, ceph-01, ceph-26
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1260 osds: 1260 up (since 2d), 1260 in (since 2d); 6481 remapped pgs

  task status:

  data:
    pools:   14 pools, 25065 pgs
    objects: 1.49G objects, 2.8 PiB
    usage:   3.4 PiB used, 9.7 PiB / 13 PiB avail
    pgs:     2466120124/12911195308 objects misplaced (19.101%)
             18574 active+clean
             6247  active+remapped+backfill_wait
             234   active+remapped+backfilling
             6     active+clean+snaptrim
             2     active+clean+scrubbing+deep
             2     active+clean+scrubbing

  io:
    client:   34 MiB/s rd, 236 MiB/s wr, 1.28k op/s rd, 2.03k op/s wr
    recovery: 6.4 GiB/s, 2.39k objects/s

[root@gnosis ~]# ceph config set osd/class:hdd osd_max_backfills 3
[root@gnosis ~]# ceph status
  cluster:
    id:     ###
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 7d)
    mgr: ceph-25(active, since 10w), standbys: ceph-03, ceph-02, ceph-01, ceph-26
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1260 osds: 1260 up (since 2d), 1260 in (since 2d); 6481 remapped pgs

  task status:

  data:
    pools:   14 pools, 25065 pgs
    objects: 1.49G objects, 2.8 PiB
    usage:   3.4 PiB used, 9.7 PiB / 13 PiB avail
    pgs:     2465974875/12911218789 objects misplaced (19.099%)
             18578 active+clean
             6339  active+remapped+backfill_wait
             142   active+remapped+backfilling
             6     active+clean+snaptrim

  io:
    client:   32 MiB/s rd, 247 MiB/s wr, 1.10k op/s rd, 1.57k op/s wr
    recovery: 4.2 GiB/s, 1.56k objects/s

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx