Re: [Octopus] OSD overloading

Ashley Merrick <singapore@xxxxxxxxxxxxxx> · Thu, 09 Apr 2020 10:51:37 +0800

Are you sure your not being hit by:

ceph config set osd bluestore_fsck_quick_fix_on_mount false @ https://docs.ceph.com/docs/master/releases/octopus/

Have all your OSD's successfully completed the fsck?

Reasons I say that is I can see "20 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats"

---- On Thu, 09 Apr 2020 02:15:02 +0800 Jack <mailto:ceph@xxxxxxxxxxxxxx> wrote ----

Just to confirm this does not get better: 

root@backup1:~# ceph status 
 cluster: 
 id:     9cd41f0f-936d-4b59-8e5d-9b679dae9140 
 health: HEALTH_WARN 
 20 OSD(s) reporting legacy (not per-pool) BlueStore omap 
usage stats 
 4/50952060 objects unfound (0.000%) 
 nobackfill,norecover,noscrub,nodeep-scrub flag(s) set 
 1 osds down 
 3 nearfull osd(s) 
 Reduced data availability: 826 pgs inactive, 616 pgs down, 
185 pgs peering, 158 pgs stale 
 Low space hindering backfill (add storage if this doesn't 
resolve itself): 93 pgs backfill_toofull 
 Degraded data redundancy: 13285415/101904120 objects 
degraded (13.037%), 706 pgs degraded, 696 pgs undersized 
 989 pgs not deep-scrubbed in time 
 378 pgs not scrubbed in time 
 10 pool(s) nearfull 
 2216 slow ops, oldest one blocked for 13905 sec, daemons 
[osd.1,osd.11,osd.20,osd.24,osd.25,osd.29,osd.31,osd.37,osd.4,osd.5]... 
have slow ops. 

 services: 
 mon: 1 daemons, quorum backup1 (age 8d) 
 mgr: backup1(active, since 8d) 
 osd: 37 osds: 26 up (since 9m), 27 in (since 2h); 626 remapped pgs 
 flags nobackfill,norecover,noscrub,nodeep-scrub 
 rgw: 1 daemon active (backup1.odiso.net) 

 task status: 

 data: 
 pools:   10 pools, 2785 pgs 
 objects: 50.95M objects, 92 TiB 
 usage:   121 TiB used, 39 TiB / 160 TiB avail 
 pgs:     29.659% pgs not active 
 13285415/101904120 objects degraded (13.037%) 
 433992/101904120 objects misplaced (0.426%) 
 4/50952060 objects unfound (0.000%) 
 840 active+clean+snaptrim_wait 
 536 down 
 490 active+undersized+degraded+remapped+backfilling 
 326 active+clean 
 113 peering 
 88  active+undersized+degraded 
 83  active+undersized+degraded+remapped+backfill_toofull 
 79  stale+down 
 63  stale+peering 
 51  active+clean+snaptrim 
 24  activating 
 22  active+recovering+degraded 
 19  active+remapped+backfilling 
 13  stale+active+undersized+degraded 
 9   remapped+peering 
 9   active+undersized+remapped+backfilling 
 9 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull 
 2   stale+active+clean+snaptrim 
 2   active+undersized 
 1   stale+active+clean+snaptrim_wait 
 1   active+remapped+backfill_toofull 
 1   active+clean+snaptrim_wait+laggy 
 1   active+recovering+undersized+remapped 
 1   down+remapped 
 1   activating+undersized+degraded+remapped 
 1   active+recovering+laggy 

On 4/8/20 3:27 PM, Jack wrote: 
> The CPU is used by userspace, not kernelspace 
> 
> Here is the perf top, see attachment 
> 
> Rocksdb eats everything :/ 
> 
> 
> On 4/8/20 3:14 PM, Paul Emmerich wrote: 
>> What's the CPU busy with while spinning at 100%? 
>> 
>> Check "perf top" for a quick overview 
>> 
>> 
>> Paul 
>> 
> 
> 
> _______________________________________________ 
> ceph-users mailing list -- mailto:ceph-users@xxxxxxx 
> To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx 
> 
_______________________________________________ 
ceph-users mailing list -- mailto:ceph-users@xxxxxxx 
To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx