Hi, there it's 1.2% not 1200%. On Wed, Mar 6, 2019 at 4:36 PM Simon Ironside <sironside@xxxxxxxxxxxxx> wrote: > > Hi, > > I'm still seeing this issue during failure testing of a new Mimic 13.2.4 > cluster. To reproduce: > > - Working Mimic 13.2.4 cluster > - Pull a disk > - Wait for recovery to complete (i.e. back to HEALTH_OK) > - Remove the OSD with `ceph osd crush remove` > - See greater than 100% degraded objects while it recovers as below > > It doesn't seem to do any harm, once recovery completes the cluster > returns to HEALTH_OK. > I can only find bug 21803 on the tracker that seems to cover this > behaviour which is marked as resolved. > > Simon > > cluster: > id: MY ID > health: HEALTH_WARN > 709/58572 objects misplaced (1.210%) > Degraded data redundancy: 90094/58572 objects degraded > (153.818%), 49 pgs degraded, 51 pgs undersized > > services: > mon: 3 daemons, quorum san2-mon1,san2-mon2,san2-mon3 > mgr: san2-mon1(active), standbys: san2-mon2, san2-mon3 > osd: 52 osds: 52 up, 52 in; 84 remapped pgs > > data: > pools: 16 pools, 2016 pgs > objects: 19.52 k objects, 72 GiB > usage: 7.8 TiB used, 473 TiB / 481 TiB avail > pgs: 90094/58572 objects degraded (153.818%) > 709/58572 objects misplaced (1.210%) > 1932 active+clean > 47 active+recovery_wait+undersized+degraded+remapped > 33 active+remapped+backfill_wait > 2 active+recovering+undersized+remapped > 1 active+recovery_wait+undersized+degraded > 1 active+recovering+undersized+degraded+remapped > > io: > client: 24 KiB/s wr, 0 op/s rd, 3 op/s wr > recovery: 0 B/s, 126 objects/s > > > On 13/10/2017 18:53, David Zafman wrote: > > > > I improved the code to compute degraded objects during > > backfill/recovery. During my testing it wouldn't result in percentage > > above 100%. I'll have to look at the code and verify that some > > subsequent changes didn't break things. > > > > David > > > > > > On 10/13/17 9:55 AM, Florian Haas wrote: > >>>>> Okay, in that case I've no idea. What was the timeline for the > >>>>> recovery > >>>>> versus the rados bench and cleanup versus the degraded object counts, > >>>>> then? > >>>> 1. Jewel deployment with filestore. > >>>> 2. Upgrade to Luminous (including mgr deployment and "ceph osd > >>>> require-osd-release luminous"), still on filestore. > >>>> 3. rados bench with subsequent cleanup. > >>>> 4. All OSDs up, all PGs active+clean. > >>>> 5. Stop one OSD. Remove from CRUSH, auth list, OSD map. > >>>> 6. Reinitialize OSD with bluestore. > >>>> 7. Start OSD, commencing backfill. > >>>> 8. Degraded objects above 100%. > >>>> > >>>> Please let me know if that information is useful. Thank you! > >>> > >>> Hmm, that does leave me a little perplexed. > >> Yeah exactly, me too. :) > >> > >>> David, do we maybe do something with degraded counts based on the > >>> number of > >>> objects identified in pg logs? Or some other heuristic for number of > >>> objects > >>> that might be stale? That's the only way I can think of to get these > >>> weird > >>> returning sets. > >> One thing that just crossed my mind: would it make a difference > >> whether after the OSD goes out or not, in the time window between it > >> going down and being deleted from the crushmap/osdmap? I think it > >> shouldn't (whether being marked out or just non-existent, it's not > >> eligible for holding any data so either way), but I'm not really sure > >> about the mechanics of the internals here. > >> > >> Cheers, > >> Florian > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com