Yep I am The issue is solved now .. and by solved, brace yourselves, I mean I had to recreate all OSDs And this the cluster would not heal itself (because of the original issue), I had to drop every rados pool, stop all OSDs, destroy & recreate them .. Yeah, well, hum There is definitly an underlying issue there Those OSDs were created and upgraded since Luminous I have no more cue on the bug Sadly, there is only so much downtime I can afford on this cluster Anyway .. On 4/9/20 4:51 AM, Ashley Merrick wrote: > Are you sure your not being hit by: > > > > ceph config set osd bluestore_fsck_quick_fix_on_mount false @ https://docs.ceph.com/docs/master/releases/octopus/ > > Have all your OSD's successfully completed the fsck? > > > > Reasons I say that is I can see "20 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats" > > > > > > ---- On Thu, 09 Apr 2020 02:15:02 +0800 Jack <mailto:ceph@xxxxxxxxxxxxxx> wrote ---- > > > > Just to confirm this does not get better: > > root@backup1:~# ceph status > cluster: > id: 9cd41f0f-936d-4b59-8e5d-9b679dae9140 > health: HEALTH_WARN > 20 OSD(s) reporting legacy (not per-pool) BlueStore omap > usage stats > 4/50952060 objects unfound (0.000%) > nobackfill,norecover,noscrub,nodeep-scrub flag(s) set > 1 osds down > 3 nearfull osd(s) > Reduced data availability: 826 pgs inactive, 616 pgs down, > 185 pgs peering, 158 pgs stale > Low space hindering backfill (add storage if this doesn't > resolve itself): 93 pgs backfill_toofull > Degraded data redundancy: 13285415/101904120 objects > degraded (13.037%), 706 pgs degraded, 696 pgs undersized > 989 pgs not deep-scrubbed in time > 378 pgs not scrubbed in time > 10 pool(s) nearfull > 2216 slow ops, oldest one blocked for 13905 sec, daemons > [osd.1,osd.11,osd.20,osd.24,osd.25,osd.29,osd.31,osd.37,osd.4,osd.5]... > have slow ops. > > services: > mon: 1 daemons, quorum backup1 (age 8d) > mgr: backup1(active, since 8d) > osd: 37 osds: 26 up (since 9m), 27 in (since 2h); 626 remapped pgs > flags nobackfill,norecover,noscrub,nodeep-scrub > rgw: 1 daemon active (backup1.odiso.net) > > task status: > > data: > pools: 10 pools, 2785 pgs > objects: 50.95M objects, 92 TiB > usage: 121 TiB used, 39 TiB / 160 TiB avail > pgs: 29.659% pgs not active > 13285415/101904120 objects degraded (13.037%) > 433992/101904120 objects misplaced (0.426%) > 4/50952060 objects unfound (0.000%) > 840 active+clean+snaptrim_wait > 536 down > 490 active+undersized+degraded+remapped+backfilling > 326 active+clean > 113 peering > 88 active+undersized+degraded > 83 active+undersized+degraded+remapped+backfill_toofull > 79 stale+down > 63 stale+peering > 51 active+clean+snaptrim > 24 activating > 22 active+recovering+degraded > 19 active+remapped+backfilling > 13 stale+active+undersized+degraded > 9 remapped+peering > 9 active+undersized+remapped+backfilling > 9 > active+undersized+degraded+remapped+backfill_wait+backfill_toofull > 2 stale+active+clean+snaptrim > 2 active+undersized > 1 stale+active+clean+snaptrim_wait > 1 active+remapped+backfill_toofull > 1 active+clean+snaptrim_wait+laggy > 1 active+recovering+undersized+remapped > 1 down+remapped > 1 activating+undersized+degraded+remapped > 1 active+recovering+laggy > > On 4/8/20 3:27 PM, Jack wrote: >> The CPU is used by userspace, not kernelspace >> >> Here is the perf top, see attachment >> >> Rocksdb eats everything :/ >> >> >> On 4/8/20 3:14 PM, Paul Emmerich wrote: >>> What's the CPU busy with while spinning at 100%? >>> >>> Check "perf top" for a quick overview >>> >>> >>> Paul >>> >> >> >> _______________________________________________ >> ceph-users mailing list -- mailto:ceph-users@xxxxxxx >> To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx >> > _______________________________________________ > ceph-users mailing list -- mailto:ceph-users@xxxxxxx > To unsubscribe send an email to mailto:ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx