On Mon, Apr 20, 2020 at 11:01 PM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
On a cluster running Nautilus (14.2.8), we are getting a complaint about
a PG not being deep-scrubbed on time. Looking at the primary OSD's
logs, it looks like it tries to deep-scrub the PG every hour or so,
emits some complaints that I don't understand, but the deep scrub does
not finish (either with or without a scrub error).
Here is the PG from pg dump:
1.43f 31794 0 0 0 0
66930087214 0 0 3004 3004
active+clean+scrubbing+deep 2020-04-20 04:48:13.055481 46286'483734
46286:563439 [354,694,851] 354 [354,694,851] 354
39594'348643 2020-04-10 12:39:16.261088 38482'314644 2020-04-04
10:37:03.121638 0
Here is a section of the primary OSD logs (osd.354):
2020-04-18 08:21:08.322 7fffd2e2b700 0 log_channel(cluster) log [DBG] :
1.43f deep-scrub starts
Scrubbing starts.
2020-04-18 08:37:53.362 7fffd2e2b700 1 osd.354 pg_epoch: 45910
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f
45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910
pi=[45908,45910)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0
active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
[354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
upacting 4611087854031667199
2020-04-18 08:37:53.362 7fffd2e2b700 1 osd.354 pg_epoch: 45910
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f
45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910
pi=[45908,45910)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:38:01.002 7fffd2e2b700 1 osd.354 pg_epoch: 45912
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f
45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912
pi=[45908,45912)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0
active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
4611087854031667199
Epoch (and therefore map) change from 45910 to 45912 You lost osd 694
from the acting set so peering has to restart and the scrub is
aborted.
2020-04-18 08:38:01.002 7fffd2e2b700 1 osd.354 pg_epoch: 45912
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f
45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912
pi=[45908,45912)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:40:04.219 7fffd2e2b700 0 log_channel(cluster) log [DBG] :
1.43f deep-scrub starts
Scrubbing start again.
2020-04-18 08:56:49.095 7fffd2e2b700 1 osd.354 pg_epoch: 45914
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f
45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914
pi=[45912,45914)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0
active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
[354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
upacting 4611087854031667199
2020-04-18 08:56:49.095 7fffd2e2b700 1 osd.354 pg_epoch: 45914
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f
45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914
pi=[45912,45914)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:56:55.627 7fffd2e2b700 1 osd.354 pg_epoch: 45916
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f
45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916
pi=[45912,45916)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0
active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
4611087854031667199
Same again, you lost osd 694 from the acting set, epoch and map change
requiring repeering and aborting the scrub.
You need to identify why osd 694 is flapping, or at least appears to
be flapping to the monitor.
Start by having a careful look at the logs for osd 694. If they
provide no insight look at the logs on the active monitor for any
reference to osd 694 around the time the epoch changes.
2020-04-18 08:56:55.627 7fffd2e2b700 1 osd.354 pg_epoch: 45916
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f
45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916
pi=[45912,45916)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:56:56.867 7fffd2e2b700 0 log_channel(cluster) log [DBG] :
1.43f deep-scrub starts
2020-04-18 09:13:37.419 7fffd2e2b700 1 osd.354 pg_epoch: 45918
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f
45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918
pi=[45916,45918)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0
active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
[354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
upacting 4611087854031667199
2020-04-18 09:13:37.419 7fffd2e2b700 1 osd.354 pg_epoch: 45918
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f
45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918
pi=[45916,45918)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 09:13:44.959 7fffd2e2b700 1 osd.354 pg_epoch: 45920
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f
45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920
pi=[45916,45920)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0
active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
4611087854031667199
2020-04-18 09:13:44.959 7fffd2e2b700 1 osd.354 pg_epoch: 45920
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f
45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920
pi=[45916,45920)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
I've tried restarting all 3 OSDs in question, it didn't help. I'm
running the upmap balancer on this cluster (if that matters). Any ideas?
Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx