Re: [Octopus] OSD overloading

Igor Fedotov <ifedotov@xxxxxxx> · Mon, 13 Apr 2020 13:20:11 +0300

Given the symptoms high CPU usage within RocksDB and corresponding 
slowdown were presumably caused by RocksDB fragmentation.

And temporary workaround would be to do manual DB compaction using  
ceph-kvstore-tool's compact command.

Thanks,

Igor

On 4/13/2020 1:01 AM, Jack wrote:
Yep I am

The issue is solved now .. and by solved, brace yourselves, I mean I had
to recreate all OSDs

And this the cluster would not heal itself (because of the original
issue), I had to drop every rados pool, stop all OSDs, destroy &
recreate them ..
Yeah, well, hum

There is definitly an underlying issue there
Those OSDs were created and upgraded since Luminous

I have no more cue on the bug
Sadly, there is only so much downtime I can afford on this cluster

Anyway ..

On 4/9/20 4:51 AM, Ashley Merrick wrote:
Are you sure your not being hit by:

ceph config set osd bluestore_fsck_quick_fix_on_mount false @ https://docs.ceph.com/docs/master/releases/octopus/

Have all your OSD's successfully completed the fsck?

Reasons I say that is I can see "20 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats"

---- On Thu, 09 Apr 2020 02:15:02 +0800 Jack <mailto:ceph@xxxxxxxxxxxxxx> wrote ----

Just to confirm this does not get better:

root@backup1:~# ceph status
  cluster:
  id:     9cd41f0f-936d-4b59-8e5d-9b679dae9140
  health: HEALTH_WARN
  20 OSD(s) reporting legacy (not per-pool) BlueStore omap
usage stats
  4/50952060 objects unfound (0.000%)
  nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
  1 osds down
  3 nearfull osd(s)
  Reduced data availability: 826 pgs inactive, 616 pgs down,
185 pgs peering, 158 pgs stale
  Low space hindering backfill (add storage if this doesn't
resolve itself): 93 pgs backfill_toofull
  Degraded data redundancy: 13285415/101904120 objects
degraded (13.037%), 706 pgs degraded, 696 pgs undersized
  989 pgs not deep-scrubbed in time
  378 pgs not scrubbed in time
  10 pool(s) nearfull
  2216 slow ops, oldest one blocked for 13905 sec, daemons
[osd.1,osd.11,osd.20,osd.24,osd.25,osd.29,osd.31,osd.37,osd.4,osd.5]...
have slow ops.

  services:
  mon: 1 daemons, quorum backup1 (age 8d)
  mgr: backup1(active, since 8d)
  osd: 37 osds: 26 up (since 9m), 27 in (since 2h); 626 remapped pgs
  flags nobackfill,norecover,noscrub,nodeep-scrub
  rgw: 1 daemon active (backup1.odiso.net)

  task status:

  data:
  pools:   10 pools, 2785 pgs
  objects: 50.95M objects, 92 TiB
  usage:   121 TiB used, 39 TiB / 160 TiB avail
  pgs:     29.659% pgs not active
  13285415/101904120 objects degraded (13.037%)
  433992/101904120 objects misplaced (0.426%)
  4/50952060 objects unfound (0.000%)
  840 active+clean+snaptrim_wait
  536 down
  490 active+undersized+degraded+remapped+backfilling
  326 active+clean
  113 peering
  88  active+undersized+degraded
  83  active+undersized+degraded+remapped+backfill_toofull
  79  stale+down
  63  stale+peering
  51  active+clean+snaptrim
  24  activating
  22  active+recovering+degraded
  19  active+remapped+backfilling
  13  stale+active+undersized+degraded
  9   remapped+peering
  9   active+undersized+remapped+backfilling
  9
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
  2   stale+active+clean+snaptrim
  2   active+undersized
  1   stale+active+clean+snaptrim_wait
  1   active+remapped+backfill_toofull
  1   active+clean+snaptrim_wait+laggy
  1   active+recovering+undersized+remapped
  1   down+remapped
  1   activating+undersized+degraded+remapped
  1   active+recovering+laggy

On 4/8/20 3:27 PM, Jack wrote:
The CPU is used by userspace, not kernelspace

Here is the perf top, see attachment

Rocksdb eats everything :/

On 4/8/20 3:14 PM, Paul Emmerich wrote:
What's the CPU busy with while spinning at 100%?

Check "perf top" for a quick overview

Paul

_______________________________________________
ceph-users mailing list -- mailto:ceph-users@xxxxxxx
To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- mailto:ceph-users@xxxxxxx
To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx