Quoting Dan van der Ster (dan@xxxxxxxxxxxxxx): > Haven't seen that exact issue. > > One thing to note though is that if osd_max_backfills is set to 1, > then it can happen that PGs get into backfill state, taking that > single reservation on a given OSD, and therefore the recovery_wait PGs > can't get a slot. > I suppose that backfill prioritization is supposed to prevent this, > but in my experience luminous v12.2.8 doesn't always get it right. That's also our experience. Even if if the degraded PGs with backfill / recovery state are given a higher priority (forced) ... than still normally backfilling takes place. > So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart > the recovering PGs. Wat still on "1" indeed. We tend to cranck that (and max recovery) with keeping an eye on max read and write apply latency. In our setup we can do 16 backfills concurrently / and or 2 recovery / 4 backfills. Recovery speeds ~ 4 - 5 GB/s ... pushing it beyond that tends to crashing OSDs. We'll try your suggestion next time. Thanks, Stefan -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com