I had to check logs once again to find out what I exactly did...
1. Destroyed 2 OSDs from host pirat and recreated them, but backfilling
was still in progress:
2022-10-26T13:22:13.744545+0200 mgr.skarb (mgr.40364478) 93039 : cluster
[DBG] pgmap v94205: 285 pgs: 2
active+undersized+degraded+remapped+backfilling, 28
active+undersized+degraded+remapped+backfill_wait, 237
active+remapped+backfill_wait, 18 active+clean; 3.6 TiB data, 10 TiB
used, 19 TiB / 29 TiB avail; 4.1 MiB/s rd, 2.7 MiB/s wr, 317 op/s;
145580/3082260 objects degraded (4.723%); 1528324/3082260 objects
misplaced (49.585%); 17 MiB/s, 4 objects/s recovering
2. Added 9th OSD to the same host, created new pool using new rule. PGs
status was:
2022-10-26T13:39:37.673870+0200 mgr.skarb (mgr.40364478) 93619 : cluster
[DBG] pgmap v94755: 413 pgs: 3
active+undersized+degraded+remapped+backfilling, 27
active+undersized+degraded+remapped+backfill_wait, 237
active+remapped+backfill_wait, 146 active+clean; 3.6 TiB data, 10 TiB
used, 20 TiB / 30 TiB avail; 340 KiB/s rd, 1014 KiB/s wr, 162 op/s;
135452/3063849 objects degraded (4.421%); 1513993/3063849 objects
misplaced (49.415%); 52 MiB/s, 13 objects/s recovering
3. I left it like this for 33 hours (not 9, my mistake).
4. Two PGs went inactive.
Original default rule was able to use new OSD, but in the meantime (at
step 3) I modified default rule to place data only on hdd drives. I
don't remember when exactly this change was made, but I'm sure it was
done no later than 6 hours before step 4.
Since that time no changes were made, and today everything works as
expected according to defined rules, health is ok (one PG is deep
scrubbing, all the rest active+clean)
Paweł
W dniu 2.11.2022 o 13:49, Eugen Block pisze:
Hi,
So I guess, that if max PGs per OSD was an issue, the problem should
appear right after creating new pool, am I right?
it would happen right after removing or adding OSDs (btw, the default
is 250 PGs/OSD). But with only around 400 PG and assuming a pool size
of 3 you shouldn't be facing that.
One thing which makes me confused is the total number of PGs. After
adding last OSD I created new pool to use new rule. This new pool
should also use existing OSD's, and it created 128 new PGs, which
changed total count of PGs from 285 to 413. It happened approx. 9
hours before those 2 PGs went inactive. During that 9 hours total
count od PGs dropped to 410. Today I see that total PGs were adjusted
to 225.
This just sounds like the autoscaler is working, you already wrote
that it's enabled.
So just to get the story straight, this is what happened:
- You rebuilt 2 OSDs (so one entire host)
- Backfill finished
- You added the NVMe drive
- Then the inactive PGs appeared?
Or did I misunderstand something? I see inactive PGs for a very short
period of time when I resize pools, but not for hours. To me it sounds
a bit like crush can't find a suitable host with its number of
retries. Do your rules work as expected?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx