Hi,
So I guess, that if max PGs per OSD was an issue, the problem should
appear right after creating new pool, am I right?
it would happen right after removing or adding OSDs (btw, the default
is 250 PGs/OSD). But with only around 400 PG and assuming a pool size
of 3 you shouldn't be facing that.
One thing which makes me confused is the total number of PGs. After
adding last OSD I created new pool to use new rule. This new pool
should also use existing OSD's, and it created 128 new PGs, which
changed total count of PGs from 285 to 413. It happened approx. 9
hours before those 2 PGs went inactive. During that 9 hours total
count od PGs dropped to 410. Today I see that total PGs were
adjusted to 225.
This just sounds like the autoscaler is working, you already wrote
that it's enabled.
So just to get the story straight, this is what happened:
- You rebuilt 2 OSDs (so one entire host)
- Backfill finished
- You added the NVMe drive
- Then the inactive PGs appeared?
Or did I misunderstand something? I see inactive PGs for a very short
period of time when I resize pools, but not for hours. To me it sounds
a bit like crush can't find a suitable host with its number of
retries. Do your rules work as expected?
Zitat von Paweł Kowalski <pk@xxxxxxxxxxxx>:
No, I couldn't find anything odd in osd.2 log, but I'm not very
familiar with ceph so it's likely I missed something.
Did I hit the 300PGs/osd limit? I'm not sure since I can't find any
log entry about it, and I don't know how to calculate PGs count on
that OSD for that moment.
One thing which makes me confused is the total number of PGs. After
adding last OSD I created new pool to use new rule. This new pool
should also use existing OSD's, and it created 128 new PGs, which
changed total count of PGs from 285 to 413. It happened approx. 9
hours before those 2 PGs went inactive. During that 9 hours total
count od PGs dropped to 410. Today I see that total PGs were
adjusted to 225.
So I guess, that if max PGs per OSD was an issue, the problem should
appear right after creating new pool, am I right?
So, here's a osd tree:
root@storage3:/var/log/ceph# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 29.97499 root default
-12 14.55099 datacenter alp01
-25 14.55099 rack alp01_r1
-9 7.27499 host galeon
1 hdd 3.63799 osd.1 up 1.00000 1.00000
4 hdd 3.63799 osd.4 up 1.00000 1.00000
-7 7.27499 host storage3
2 hdd 3.63799 osd.2 up 1.00000 1.00000
5 hdd 3.63799 osd.5 up 1.00000 1.00000
-13 0 datacenter ark01
-27 0 rack ark01_r1
-11 15.42400 datacenter fab01
-14 0 rack fab01_r1
-16 0 rack fab01_r2
-19 0 rack fab01_r3
-20 15.42400 rack fab01_r4
-3 8.14899 host pirat
0 hdd 3.63799 osd.0 up 1.00000 1.00000
6 hdd 3.63799 osd.6 up 1.00000 1.00000
8 nvme 0.87299 osd.8 up 1.00000 1.00000
-5 7.27499 host skarb
3 hdd 3.63799 osd.3 up 1.00000 1.00000
7 hdd 3.63799 osd.7 up 1.00000 1.00000
And the rules:
# rules rule replicated_rule { id 0 type replicated min_size 1
max_size 10 step take default class hdd step chooseleaf firstn 0
type host step emit } rule first_nvme_next_hdd { id 1 type
replicated min_size 1 max_size 10 step take default class nvme step
chooseleaf firstn 1 type host step emit step take default class hdd
step chooseleaf firstn 0 type host step emit }
Affected pool is the one using the default "replicated_rule".
New created pool is using "first_nvme_next_hdd" rule.
Paweł
W dniu 29.10.2022 o 10:55, Eugen Block pisze:
Do you see anything in the OSD logs why the PGs were inactive? Did
you hit the max PGs per OSD limit? What crush rule is in place for
the affected pool? Your osd tree could help as well to figure out
what happened.
Zitat von Paweł Kowalski <pk@xxxxxxxxxxxx>:
Hello,
2 PG's were stuck in "inactive state". I'm trying to find out why
this happened. Here's what I found in logs (only what I think is
important):
2022-10-27T22:39:29.707362+0200 mgr.skarb (mgr.40364478) 152930 :
cluster [DBG] pgmap v154892: 410 pgs: 1
active+remapped+backfilling, 2 active+clean+scrubbing+deep, 1
active+clean+snaptrim, 24 active+remapped+backfill_wait, 382
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail; 11
KiB/s rd, 1.1 MiB/s wr, 168 op/s; 142367/2859735 objects misplaced
(4.978%); 12 MiB/s, 3 objects/s recovering
2022-10-27T22:39:30.266560+0200 mon.skarb (mon.0) 4717148 :
cluster [DBG] osdmap e18492: 9 total, 9 up, 9 in
[...]
2022-10-27T22:39:31.317055+0200 osd.3 (osd.3) 106245 : cluster
[DBG] 2.51 starting backfill to osd.7 from (0'0,0'0] MAX to
18489'135867589
2022-10-27T22:39:31.524742+0200 osd.3 (osd.3) 106246 : cluster
[INF] 2.31 continuing backfill to osd.2 from
(18485'135835568,18490'135837516] MIN to 18490'135837516
2022-10-27T22:39:31.713775+0200 mgr.skarb (mgr.40364478) 152931 :
cluster [DBG] pgmap v154895: 409 pgs: 3 peering, 1
active+remapped+backfilling, 2 active+clean+scrubbing+deep, 24
active+remapped+backfil
l_wait, 379 active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30
TiB avail; 5.8 KiB/s rd, 874 KiB/s wr, 108 op/s; 142367/2859732
objects misplaced (4.978%); 10 MiB/s, 3 objects/s recovering
[...]
2022-10-27T22:39:33.333564+0200 mon.skarb (mon.0) 4717153 :
cluster [DBG] osdmap e18495: 9 total, 9 up, 9 in
2022-10-27T22:39:33.337963+0200 osd.2 (osd.2) 62236 : cluster
[DBG] 2.77 starting backfill to osd.1 from (0'0,0'0] MAX to
18489'149579766
2022-10-27T22:39:33.338240+0200 osd.2 (osd.2) 62237 : cluster
[DBG] 2.37 starting backfill to osd.1 from (0'0,0'0] MAX to
18489'149493224
2022-10-27T22:39:33.718993+0200 mgr.skarb (mgr.40364478) 152932 :
cluster [DBG] pgmap v154898: 409 pgs: 13 activating+remapped, 3
remapped+peering, 3 peering, 2 active+remapped+backfilling, 2
active+clea
n+scrubbing+deep, 23 active+remapped+backfill_wait, 363
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail;
804 KiB/s wr, 98 op/s; 208558/2859732 objects misplaced (7.293%);
35 MiB/s, 8 objec
ts/s recovering
[...]
2022-10-27T22:39:51.784172+0200 mgr.skarb (mgr.40364478) 152941 :
cluster [DBG] pgmap v154913: 408 pgs: 2
remapped+premerge+backfill_wait+peered, 3
active+remapped+backfilling, 2 active+clean+scrubbing+d
eep, 87 active+remapped+backfill_wait, 314 active+clean; 3.5 TiB
data, 11 TiB used, 19 TiB / 30 TiB avail; 451185/2858388 objects
misplaced (15.785%)
[...]
2022-10-27T22:40:00.000177+0200 mon.skarb (mon.0) 4717169 :
cluster [INF] overall HEALTH_OK
2022-10-27T22:40:01.821923+0200 mgr.skarb (mgr.40364478) 152946 :
cluster [DBG] pgmap v154918: 408 pgs: 2
remapped+premerge+backfill_wait+peered, 3
active+remapped+backfilling, 2 active+clean+scrubbing+d
eep, 87 active+remapped+backfill_wait, 314 active+clean; 3.5 TiB
data, 11 TiB used, 19 TiB / 30 TiB avail; 117 MiB/s rd, 150 MiB/s
wr, 5.92k op/s; 450997/2859732 objects misplaced (15.771%); 264
MiB/s, 7
4 objects/s recovering
[...]
2022-10-27T22:40:32.115085+0200 mon.skarb (mon.0) 4717180 :
cluster [WRN] Health check failed: Reduced data availability: 2
pgs inactive (PG_AVAILABILITY)
2022-10-27T22:40:34.038597+0200 mgr.skarb (mgr.40364478) 152962 :
cluster [DBG] pgmap v154934: 408 pgs: 2
remapped+premerge+backfill_wait+peered, 3
active+remapped+backfilling, 2 active+clean+scrubbing+deep, 87
active+remapped+backfill_wait, 314 active+clean; 3.5 TiB data, 11
TiB used, 19 TiB / 30 TiB avail; 6.0 KiB/s rd, 3.2 MiB/s wr, 209
op/s; 450255/2859732 objects misplaced (15.745%); 44 MiB/s, 14
objects/s recovering
[...]
2022-10-27T22:49:59.062366+0200 mgr.skarb (mgr.40364478) 153243 :
cluster [DBG] pgmap v155219: 408 pgs: 1
active+clean+scrubbing+deep, 2
remapped+premerge+backfill_wait+peered, 4 active+remapped+backfill
ing, 84 active+remapped+backfill_wait, 317 active+clean; 3.5 TiB
data, 11 TiB used, 19 TiB / 30 TiB avail; 4.6 KiB/s rd, 1.6 MiB/s
wr, 108 op/s; 438424/2859736 objects misplaced (15.331%); 57
MiB/s, 16 o
bjects/s recovering
2022-10-27T22:50:00.000198+0200 mon.skarb (mon.0) 4717320 :
cluster [WRN] Health detail: HEALTH_WARN Reduced data
availability: 2 pgs inactive
2022-10-27T22:50:00.000245+0200 mon.skarb (mon.0) 4717321 :
cluster [WRN] [WRN] PG_AVAILABILITY: Reduced data availability: 2
pgs inactive
2022-10-27T22:50:00.000263+0200 mon.skarb (mon.0) 4717322 :
cluster [WRN] pg 2.37 is stuck inactive for 10m, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-27T22:50:00.000307+0200 mon.skarb (mon.0) 4717323 :
cluster [WRN] pg 2.77 is stuck inactive for 10m, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-27T22:50:01.069589+0200 mgr.skarb (mgr.40364478) 153244 :
cluster [DBG] pgmap v155220: 408 pgs: 1
active+clean+scrubbing+deep, 2
remapped+premerge+backfill_wait+peered, 4 active+remapped+backfill
ing, 84 active+remapped+backfill_wait, 317 active+clean; 3.5 TiB
data, 11 TiB used, 19 TiB / 30 TiB avail; 4.6 KiB/s rd, 1.3 MiB/s
wr, 94 op/s; 438413/2859736 objects misplaced (15.331%); 50 MiB/s,
14 ob
jects/s recovering
It recovered a few hours later:
2022-10-28T03:43:22.268113+0200 mgr.skarb (mgr.40364478) 162018 :
cluster [DBG] pgmap v164234: 408 pgs: 1
remapped+premerge+backfilling+peered, 1 clean+premerge+peered, 1
active+clean+snaptrim, 42 active+remapped+backfill_wait, 363
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail;
1.3 KiB/s rd, 105 KiB/s wr, 22 op/s; 233157/2841886 objects
misplaced (8.204%); 28 MiB/s, 7 objects/s recovering
2022-10-28T03:43:24.269793+0200 mgr.skarb (mgr.40364478) 162019 :
cluster [DBG] pgmap v164235: 408 pgs: 1
remapped+premerge+backfilling+peered, 1 clean+premerge+peered, 1
active+clean+snaptrim, 42 active+remapped+backfill_wait, 363
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail;
1.3 KiB/s rd, 91 KiB/s wr, 19 op/s; 233157/2841886 objects
misplaced (8.204%); 19 MiB/s, 5 objects/s recovering
2022-10-28T03:43:26.271601+0200 mgr.skarb (mgr.40364478) 162020 :
cluster [DBG] pgmap v164236: 408 pgs: 1
remapped+premerge+backfilling+peered, 1 clean+premerge+peered, 1
active+clean+snaptrim, 42 active+remapped+backfill_wait, 363
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail;
1.3 KiB/s rd, 77 KiB/s wr, 17 op/s; 233157/2841877 objects
misplaced (8.204%); 28 MiB/s, 7 objects/s recovering
2022-10-28T03:43:28.273799+0200 mgr.skarb (mgr.40364478) 162021 :
cluster [DBG] pgmap v164237: 408 pgs: 1
active+remapped+backfilling, 1
remapped+premerge+backfilling+peered, 1 clean+premerge+peered, 1
active+clean+snaptrim, 41 active+remapped+backfill_wait, 363
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail;
1.3 KiB/s rd, 64 KiB/s wr, 14 op/s; 233157/2841877 objects
misplaced (8.204%); 18 MiB/s, 4 objects/s recovering
2022-10-28T03:43:28.379923+0200 mon.skarb (mon.0) 4719355 :
cluster [DBG] osdmap e18746: 9 total, 9 up, 9 in
2022-10-28T03:43:30.381806+0200 mon.skarb (mon.0) 4719356 :
cluster [WRN] Health check update: Reduced data availability: 1 pg
inactive (PG_AVAILABILITY)
2022-10-28T03:43:30.389645+0200 mon.skarb (mon.0) 4719357 :
cluster [DBG] osdmap e18747: 9 total, 9 up, 9 in
2022-10-28T03:43:30.275554+0200 mgr.skarb (mgr.40364478) 162022 :
cluster [DBG] pgmap v164239: 407 pgs: 1
active+remapped+backfilling, 1 clean+premerge+peered, 1
active+clean+snaptrim, 41 active+remapped+backfill_wait, 363
active+clean; 3.5 TiB data, 11 TiB used, 19 TiB / 30 TiB avail;
233157/2837083 objects misplaced (8.218%)
And here is an overview of this situation:
root@skarb:/var/log/ceph# cat ceph.log.1.gz | gunzip | grep inactive
2022-10-27T22:40:32.115085+0200 mon.skarb (mon.0) 4717180 :
cluster [WRN] Health check failed: Reduced data availability: 2
pgs inactive (PG_AVAILABILITY)
2022-10-27T22:50:00.000198+0200 mon.skarb (mon.0) 4717320 :
cluster [WRN] Health detail: HEALTH_WARN Reduced data
availability: 2 pgs inactive
2022-10-27T22:50:00.000245+0200 mon.skarb (mon.0) 4717321 :
cluster [WRN] [WRN] PG_AVAILABILITY: Reduced data availability: 2
pgs inactive
2022-10-27T22:50:00.000263+0200 mon.skarb (mon.0) 4717322 :
cluster [WRN] pg 2.37 is stuck inactive for 10m, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-27T22:50:00.000307+0200 mon.skarb (mon.0) 4717323 :
cluster [WRN] pg 2.77 is stuck inactive for 10m, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-27T23:00:00.000173+0200 mon.skarb (mon.0) 4717460 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-27T23:10:00.000149+0200 mon.skarb (mon.0) 4717592 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-27T23:20:00.000140+0200 mon.skarb (mon.0) 4717718 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-27T23:30:00.000198+0200 mon.skarb (mon.0) 4717855 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-27T23:40:00.000133+0200 mon.skarb (mon.0) 4717988 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-27T23:50:00.000102+0200 mon.skarb (mon.0) 4718119 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T00:00:00.000129+0200 mon.skarb (mon.0) 4718259 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
root@skarb:/var/log/ceph# cat ceph.log | grep inactive
2022-10-28T00:10:00.000139+0200 mon.skarb (mon.0) 4718393 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T00:20:00.000265+0200 mon.skarb (mon.0) 4718525 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T00:30:00.004136+0200 mon.skarb (mon.0) 4718661 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T00:40:00.000131+0200 mon.skarb (mon.0) 4718741 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T00:50:00.000130+0200 mon.skarb (mon.0) 4718774 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T01:00:00.000094+0200 mon.skarb (mon.0) 4718805 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T01:10:00.000168+0200 mon.skarb (mon.0) 4718838 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T01:20:00.000146+0200 mon.skarb (mon.0) 4718873 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T01:30:00.000152+0200 mon.skarb (mon.0) 4718904 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T01:40:00.000151+0200 mon.skarb (mon.0) 4718935 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T01:50:00.000144+0200 mon.skarb (mon.0) 4718967 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T02:00:00.000170+0200 mon.skarb (mon.0) 4719004 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T02:10:00.000158+0200 mon.skarb (mon.0) 4719047 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive
2022-10-28T02:20:00.000109+0200 mon.skarb (mon.0) 4719078 :
cluster [WRN] Health detail: HEALTH_WARN Reduced data
availability: 2 pgs inactive; 3 pgs not deep-scrubbed in time
2022-10-28T02:20:00.000151+0200 mon.skarb (mon.0) 4719079 :
cluster [WRN] [WRN] PG_AVAILABILITY: Reduced data availability: 2
pgs inactive
2022-10-28T02:20:00.000170+0200 mon.skarb (mon.0) 4719080 :
cluster [WRN] pg 2.37 is stuck inactive for 3h, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-28T02:20:00.000185+0200 mon.skarb (mon.0) 4719081 :
cluster [WRN] pg 2.77 is stuck inactive for 3h, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-28T02:30:00.000152+0200 mon.skarb (mon.0) 4719118 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 3 pgs not deep-scrubbed in time
2022-10-28T02:40:00.000141+0200 mon.skarb (mon.0) 4719150 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 3 pgs not deep-scrubbed in time
2022-10-28T02:50:00.000163+0200 mon.skarb (mon.0) 4719178 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 3 pgs not deep-scrubbed in time
2022-10-28T03:00:00.000148+0200 mon.skarb (mon.0) 4719211 :
cluster [WRN] Health detail: HEALTH_WARN Reduced data
availability: 2 pgs inactive; 4 pgs not deep-scrubbed in time
2022-10-28T03:00:00.000172+0200 mon.skarb (mon.0) 4719212 :
cluster [WRN] [WRN] PG_AVAILABILITY: Reduced data availability: 2
pgs inactive
2022-10-28T03:00:00.000186+0200 mon.skarb (mon.0) 4719213 :
cluster [WRN] pg 2.37 is stuck inactive for 4h, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-28T03:00:00.000196+0200 mon.skarb (mon.0) 4719214 :
cluster [WRN] pg 2.77 is stuck inactive for 4h, current state
remapped+premerge+backfill_wait+peered, last acting [2,6,3]
2022-10-28T03:10:00.000128+0200 mon.skarb (mon.0) 4719252 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 4 pgs not deep-scrubbed in time
2022-10-28T03:20:00.000143+0200 mon.skarb (mon.0) 4719280 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 4 pgs not deep-scrubbed in time
2022-10-28T03:30:00.000150+0200 mon.skarb (mon.0) 4719305 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 4 pgs not deep-scrubbed in time
2022-10-28T03:40:00.000151+0200 mon.skarb (mon.0) 4719341 :
cluster [WRN] overall HEALTH_WARN Reduced data availability: 2 pgs
inactive; 4 pgs not deep-scrubbed in time
2022-10-28T03:43:30.381806+0200 mon.skarb (mon.0) 4719356 :
cluster [WRN] Health check update: Reduced data availability: 1 pg
inactive (PG_AVAILABILITY)
2022-10-28T03:43:32.444590+0200 mon.skarb (mon.0) 4719360 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced
data availability: 1 pg inactive)
There are 9 OSD's, all of them were up all the time. But the last
9-th one was added just 2 days ago, and just before that 2 OSD's
from one host were destroyed and recreated. When the problem
occurred there was no PG marked as undersized (it managed to
recreate 3rd replica by that time). However, as you can see here,
PG's are being merged due to autoscaling which is turned on.
Also, since the new OSD was a nvme device, there were some minor
changes made to crush map (default rule was limited to use only
hdd OSDs, and new rule was created to pick nvme as the first OSD
for new pool).
So, what happened here and how to prevent it from happening again?
Paweł
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx