Re: Negative number of objects degraded for extended period of time

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Mon, 24 Nov 2014 17:47:19 -0800

To disable RadosGW GC, you could bump rgw_gc_obj_min_wait to something really big.   If you set it to a week, you should have a week with no GC.  When you return it to normal, it should just need a couple passes, depending on how much stuff you delete while GC is stalled.

injectargs doesn't appear to work with radosgw (at least, I can't figure out how to do it), so you'll have to edit ceph.conf and restart all of your radosgw daemons.

I think that you should have completed some of those backfills by now.  The OSDs that are currently backfilling, are they doing a lot of IO?  If they're doing practically nothing, I'd restart those ceph-osd daemons.  

You have a large number of PGs for the number of OSDs you have, over 1000 each.  An excessive number of PGs can cause memory pressure on the osd daemons.  You're not having any performance problems while this is going on?

On Mon, Nov 24, 2014 at 8:36 AM, Fred Yang <frederic.yang@xxxxxxxxx> wrote:
Well, after another 2 weeks, the backfilling still going, although it did drop(or increase?) to -0.284%. If I could count from -0.925% to -0.284% is 75% complete, probably one more week to go:

2014-11-24 11:11:56.445007 mon.0 [INF] pgmap v6805197: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 2308 GB data, 4664 GB used, 15445 GB / 20109 GB avail; 96114 B/s rd, 122 MB/s wr, 360 op/s; -5419/1906450 objects degraded (-0.284%)

Yes, I can leave it run since it's not production environment. If this indeed is production environment, I would have to answer the question quicker regarding what's the cause, and, how do I tune the pace to let the cluster back to healthy state faster rather than just cross the finger and let it run.

I suspected it might be caused by the default garbage collection process are too low to handle large amount of pending object deletion:

  "rgw_gc_max_objs": "32",
  "rgw_gc_obj_min_wait": "7200",
  "rgw_gc_processor_max_time": "3600",
  "rgw_gc_processor_period": "3600",

However, after increasing rgw_gc_max_objs to 1024, I'm actually seeing object degraded go from -0.284% to 0.301%. Which seems like this is actually garbage collector contention between multiple radosgw servers. 

I have trouble to find out document regarding how radosgw garbage collection works and how to disable garbage collector for some of the radosgw to prove that's the issue.

Yehuda Sadeh mentioned back in 2012 that ""we may also want to explore doing that as part of a bigger garbage collection scheme that we'll soon be working on." in below thread:
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7927

I'm hoping he can give some insight to this..

Fred

On Mon, Nov 17, 2014 at 5:36 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:
Well, after 4 days, this is probably moot.  Hopefully it's finished backfilling, and your problem is gone.

If not, I believe that if you fix those backfill_toofull, the negative numbers will start approaching zero.  I seem to recall that negative degraded is a special case of degraded, but I don't remember exactly, and can't find any references.  I have seen it before, and it went away when my cluster became healthy.

As long as you still have OSDs completing their backfilling, I'd let it run.  

If you get to the point that all of the backfills are done, and you're left with only wait_backfill+backfill_toofull, then you can bump osd_backfill_full_ratio, mon_osd_nearfull_ratio, and maybe osd_failsafe_nearfull_ratio.  If you do, be careful, and only bump them just enough to let them start backfilling.  If you set them to 0.99, bad things will happen.

On Thu, Nov 13, 2014 at 7:57 AM, Fred Yang <frederic.yang@xxxxxxxxx> wrote:
Hi,

The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks ago so I ran a reweight to balance it out, in the meantime, instructing application to purge data not required. But after large amount of data purge issued from application side(all OSDs' usage dropped below 20%), the cluster fall into this weird state for days, the "objects degraded" remain negative for more than 7 days, I'm seeing some IOs going on on OSDs consistently, but the number(negative) objects degraded does not change much:

2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s; -13582/1468299 objects degraded (-0.925%)
2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s; -13582/1468303 objects degraded (-0.925%)

Any idea what might be happening here? It seems active+remapped+wait_backfill+backfill_toofull stuck?

     osdmap e43029: 36 osds: 36 up, 36 in
      pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects
            3017 GB used, 17092 GB / 20109 GB avail
            -13438/1475773 objects degraded (-0.911%)
               44713 active+clean
                   1 active+backfilling
                  20 active+remapped+wait_backfill
                  27 active+remapped+wait_backfill+backfill_toofull
                  11 active+recovery_wait
                  33 active+remapped+backfilling
                  11 active+wait_backfill+backfill_toofull
  client io 478 B/s rd, 40170 kB/s wr, 80 op/s

The cluster is running on v0.72.2, we are planning to upgrade cluster to firefly, but I would like to get the cluster state clean first before the upgrade.

Thanks,
Fred

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com