Re: PG deep-scrub does not finish

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Tue, 21 Apr 2020 10:22:48 -0400

Hi Brad,

Indeed - osd.694 kept crashing with a read error (medium error on the 
hard drive), and got restarted by systemd.  So net net the system ended 
up in an infinite loop of deep scrub attempts on the PG for a week.  
Typically when a scrub encounters a read error, I get an inconsistent 
placement group, not an OSD crash and an infinite loop of scrub 
attempts.  With the inconsistent placement group a pg repair fixes the 
read error (by reallocating the sector inside the drive).

Here is the stack trace of the osd.694 crash on the scrub read error:

Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 FAILED 
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 Sense 
Key : Medium Error [current] [descriptor]
Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 Add. 
Sense: Unrecovered read error
Apr 19 03:39:17 popeye-oss-3-03 kernel: sd 14:0:27:0: [sdz] tag#1 CDB: 
Read(10) 28 00 6e a9 7a 30 00 00 80 00
Apr 19 03:39:17 popeye-oss-3-03 kernel: print_req_error: critical medium 
error, dev sdz, sector 14852804992
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 2020-04-19 03:39:17.095 
7fffd2e2b700 -1 bluestore(/var/lib/ceph/osd/ceph-694) _do_read bdev-read 
failed: (61) No data available
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 
In function 'int BlueStore::_do_read(BlueStore::Collection*, 
BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, uint32_t, 
uint64_t)' thread 7fffd2e2b700 time 2020-04-19 03:39:17.099677
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/os/bluestore/BlueStore.cc: 
9214: FAILED ceph_assert(r == 0)
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: ceph version 14.2.8 
(2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x555555a1ea4d]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 2: (()+0x4cac15) [0x555555a1ec15]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 3: 
(BlueStore::_do_read(BlueStore::Collection*, 
boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x3512) 
[0x555555f64132]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 4: 
(BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
ghobject_t const&, unsigned long, unsigned long, 
ceph::buffer::v14_2_0::list&, unsigned int)+0x1b8) [0x555555f64778]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 5: 
(ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&, 
ScrubMapBuilder&, ScrubMap::object&)+0x2c2) [0x555555deb832]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 6: 
(PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x663) 
[0x555555d082c3]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 7: 
(PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, 
hobject_t, bool, ThreadPool::TPHandle&)+0x8b) [0x555555bbaacb]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 8: 
(PG::chunky_scrub(ThreadPool::TPHandle&)+0x181c) [0x555555be4fcc]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 9: (PG::scrub(unsigned int, 
ThreadPool::TPHandle&)+0x4bb) [0x555555be61db]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 10: (PGScrub::run(OSD*, 
OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x12) 
[0x555555d8c7b2]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 11: 
(OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x90f) [0x555555b1898f]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 12: 
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) 
[0x5555560bd056]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 13: 
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5555560bfb70]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 14: (()+0x7e65) [0x7ffff5025e65]
Apr 19 03:39:17 popeye-oss-3-03 ceph-osd: 15: (clone()+0x6d) 
[0x7ffff3ee988d]

I ended up recreating the OSD (and thus overwriting all data) to fix the 
issue.

Andras

On 4/20/20 9:28 PM, Brad Hubbard wrote:
On Mon, Apr 20, 2020 at 11:01 PM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
On a cluster running Nautilus (14.2.8), we are getting a complaint about
a PG not being deep-scrubbed on time.  Looking at the primary OSD's
logs, it looks like it tries to deep-scrub the PG every hour or so,
emits some complaints that I don't understand, but the deep scrub does
not finish (either with or without a scrub error).

Here is the PG from pg dump:

1.43f     31794                  0        0         0       0
66930087214           0          0 3004     3004
active+clean+scrubbing+deep 2020-04-20 04:48:13.055481 46286'483734
46286:563439 [354,694,851]        354 [354,694,851]            354
39594'348643 2020-04-10 12:39:16.261088    38482'314644 2020-04-04
10:37:03.121638             0

Here is a section of the primary OSD logs (osd.354):

2020-04-18 08:21:08.322 7fffd2e2b700  0 log_channel(cluster) log [DBG] :
1.43f deep-scrub starts
Scrubbing starts.

2020-04-18 08:37:53.362 7fffd2e2b700  1 osd.354 pg_epoch: 45910
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f
45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910
pi=[45908,45910)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0
active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
[354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
upacting 4611087854031667199
2020-04-18 08:37:53.362 7fffd2e2b700  1 osd.354 pg_epoch: 45910
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f
45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910
pi=[45908,45910)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:38:01.002 7fffd2e2b700  1 osd.354 pg_epoch: 45912
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f
45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912
pi=[45908,45912)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0
active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
4611087854031667199
Epoch (and therefore map) change from 45910 to 45912 You lost osd 694
from the acting set so peering has to restart and the scrub is
aborted.

2020-04-18 08:38:01.002 7fffd2e2b700  1 osd.354 pg_epoch: 45912
pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f
45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912
pi=[45908,45912)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:40:04.219 7fffd2e2b700  0 log_channel(cluster) log [DBG] :
1.43f deep-scrub starts
Scrubbing start again.

2020-04-18 08:56:49.095 7fffd2e2b700  1 osd.354 pg_epoch: 45914
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f
45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914
pi=[45912,45914)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0
active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
[354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
upacting 4611087854031667199
2020-04-18 08:56:49.095 7fffd2e2b700  1 osd.354 pg_epoch: 45914
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f
45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914
pi=[45912,45914)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:56:55.627 7fffd2e2b700  1 osd.354 pg_epoch: 45916
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f
45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916
pi=[45912,45916)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0
active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
4611087854031667199
Same again, you lost osd 694 from the acting set, epoch and map change
requiring repeering and aborting the scrub.

You need to identify why osd 694 is flapping, or at least appears to
be flapping to the monitor.

Start by having a careful look at the logs for osd 694. If they
provide no insight look at the logs on the active monitor for any
reference to osd 694 around the time the epoch changes.

2020-04-18 08:56:55.627 7fffd2e2b700  1 osd.354 pg_epoch: 45916
pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f
45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916
pi=[45912,45916)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 08:56:56.867 7fffd2e2b700  0 log_channel(cluster) log [DBG] :
1.43f deep-scrub starts
2020-04-18 09:13:37.419 7fffd2e2b700  1 osd.354 pg_epoch: 45918
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f
45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918
pi=[45916,45918)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0
active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
[354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
upacting 4611087854031667199
2020-04-18 09:13:37.419 7fffd2e2b700  1 osd.354 pg_epoch: 45918
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f
45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918
pi=[45916,45918)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
2020-04-18 09:13:44.959 7fffd2e2b700  1 osd.354 pg_epoch: 45920
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f
45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920
pi=[45916,45920)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0
active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
4611087854031667199
2020-04-18 09:13:44.959 7fffd2e2b700  1 osd.354 pg_epoch: 45920
pg[1.43f( v 45917'449808 (45812'446725,45917'449808]
local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f
45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920
pi=[45916,45920)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary

I've tried restarting all 3 OSDs in question, it didn't help.  I'm
running the upmap balancer on this cluster (if that matters).  Any ideas?

Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx