Re: PG inactive - why?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I had to check logs once again to find out what I exactly did...


1. Destroyed 2 OSDs from host pirat and recreated them, but backfilling was still in progress:

2022-10-26T13:22:13.744545+0200 mgr.skarb (mgr.40364478) 93039 : cluster [DBG] pgmap v94205: 285 pgs: 2 active+undersized+degraded+remapped+backfilling, 28 active+undersized+degraded+remapped+backfill_wait, 237 active+remapped+backfill_wait, 18 active+clean; 3.6 TiB data, 10 TiB used, 19 TiB / 29 TiB avail; 4.1 MiB/s rd, 2.7 MiB/s wr, 317 op/s; 145580/3082260 objects degraded (4.723%); 1528324/3082260 objects misplaced (49.585%); 17 MiB/s, 4 objects/s recovering

2. Added 9th OSD to the same host, created new pool using new rule. PGs status was:

2022-10-26T13:39:37.673870+0200 mgr.skarb (mgr.40364478) 93619 : cluster [DBG] pgmap v94755: 413 pgs: 3 active+undersized+degraded+remapped+backfilling, 27 active+undersized+degraded+remapped+backfill_wait, 237 active+remapped+backfill_wait, 146 active+clean; 3.6 TiB data, 10 TiB used, 20 TiB / 30 TiB avail; 340 KiB/s rd, 1014 KiB/s wr, 162 op/s; 135452/3063849 objects degraded (4.421%); 1513993/3063849 objects misplaced (49.415%); 52 MiB/s, 13 objects/s recovering

3. I left it like this for 33 hours (not 9, my mistake).

4. Two PGs went inactive.


Original default rule was able to use new OSD, but in the meantime (at step 3) I modified default rule to place data only on hdd drives. I don't remember when exactly this change was made, but I'm sure it was done no later than 6 hours before step 4.

Since that time no changes were made, and today everything works as expected according to defined rules, health is ok (one PG is deep scrubbing, all the rest active+clean)


Paweł


W dniu 2.11.2022 o 13:49, Eugen Block pisze:
Hi,

So I guess, that if max PGs per OSD was an issue, the problem should appear right after creating new pool, am I right?

it would happen right after removing or adding OSDs (btw, the default is 250 PGs/OSD). But with only around 400 PG and assuming a pool size of 3 you shouldn't be facing that.

One thing which makes me confused is the total number of PGs. After adding last OSD I created new pool to use new rule. This new pool should also use existing OSD's, and it created 128 new PGs, which changed total count of PGs from 285 to 413. It happened approx. 9 hours before those 2 PGs went inactive. During that 9 hours total count od PGs dropped to 410. Today I see that total PGs were adjusted to 225.

This just sounds like the autoscaler is working, you already wrote that it's enabled.
So just to get the story straight, this is what happened:

- You rebuilt 2 OSDs (so one entire host)
- Backfill finished
- You added the NVMe drive
- Then the inactive PGs appeared?

Or did I misunderstand something? I see inactive PGs for a very short period of time when I resize pools, but not for hours. To me it sounds a bit like crush can't find a suitable host with its number of retries. Do your rules work as expected?


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux