Re: Degraded objects while OSD is being added/filled

Eino Tuominen <eino@xxxxxx> · Tue, 11 Jul 2017 19:05:26 +0000

Hi Richard,

Thanks for the explanation, that makes perfect sense. I've missed the difference between ceph osd reweight and ceph osd crush reweight. I have to study that better.

Is there a way to get ceph to prioritise fixing degraded objects over fixing misplaced ones?

-- 
  Eino Tuominen

________________________________________
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Richard Hesketh <richard.hesketh@xxxxxxxxxxxx>
Sent: Tuesday, July 11, 2017 16:36
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Degraded objects while OSD is being added/filled

First of all, your disk removal process needs tuning. "ceph osd out" sets the disk reweight to 0 but NOT the crush weight; this is why you're seeing misplaced objects after removing the osd, because the crush weights have changed (even though reweight meant that disk currently held no data). Use "ceph osd crush reweight osd.$X 0" to change the OSD's crush weight first, wait for everything to rebalance, then take it out and down - you shouldn't see any extra data movement or repeering after you remove a disk that way.

The degraded objects caused when adding/removing disks are probably down to writes taking to place to PGs which are not fully peered, right? While not yet fully peered, the primary can't replicate writes to all secondaries, so once they are peered again and agree on state they will recognise that some secondaries have out of date objects. Any repeering on an active cluster could be expected to cause a relatively small number of degraded objects due to writes taking place in that window, right?

Rich

On 11/07/17 14:17, Eino Tuominen wrote:
> Hi all,
>
>
> One more example:
>
>
> *osd.109*down out weight 0 up_from 306818 up_thru 397714 down_at 397717 last_clean_interval [306031,306809) 130.232.243.80:6814/4733 192.168.70.113:6814/4733 192.168.70.113:6815/4733 130.232.243.80:6815/4733 exists cabdfaec-eb39-4e5a-8012-9bade04c5e03
>
>
> root@ceph-osd-13:~# ceph status
>
>     cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
>
>      health HEALTH_OK
>
>      monmap e3: 3 mons at {0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}
>
>             election epoch 356, quorum 0,1,2 0,1,2
>
>      osdmap e397837: 260 osds: 259 up, 241 in
>
>             flags require_jewel_osds
>
>       pgmap v81199361: 25728 pgs, 8 pools, 203 TB data, 89794 kobjects
>
>             613 TB used, 295 TB / 909 TB avail
>
>                25696 active+clean
>
>                   32 active+clean+scrubbing+deep
>
>   client io 587 kB/s rd, 1422 kB/s wr, 357 op/s rd, 88 op/s wr
>
>
>
> Then, I remove the osd that has been evacuated:
>
> root@ceph-osd-13:~# ceph osd crush remove osd.109
>
> removed item id 109 name 'osd.109' from crush map
>
>
> Wait a few seconds to let peering to finish:
>
> root@ceph-osd-13:~# ceph status
>
>     cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
>
>      health HEALTH_WARN
>
>             484 pgs backfill_wait
>
>             81 pgs backfilling
>
>             40 pgs degraded
>
>             40 pgs recovery_wait
>
>             499 pgs stuck unclean
>
>             recovery 58391/279059773 objects degraded (0.021%)
>
>             recovery 6402109/279059773 objects misplaced (2.294%)
>
>      monmap e3: 3 mons at {0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}
>
>             election epoch 356, quorum 0,1,2 0,1,2
>
>      osdmap e397853: 260 osds: 259 up, 241 in; 565 remapped pgs
>
>             flags require_jewel_osds
>
>       pgmap v81199470: 25728 pgs, 8 pools, 203 TB data, 89794 kobjects
>
>             613 TB used, 295 TB / 909 TB avail
>
>             58391/279059773 objects degraded (0.021%)
>
>             6402109/279059773 objects misplaced (2.294%)
>
>                25100 active+clean
>
>                  484 active+remapped+wait_backfill
>
>                   81 active+remapped+backfilling
>
>                   40 active+recovery_wait+degraded
>
>                   23 active+clean+scrubbing+deep
>
> recovery io 2117 MB/s, 0 objects/s
>
>   client io 737 kB/s rd, 6719 kB/s wr, 119 op/s rd, 0 op/s wr
>
>
> --
>   Eino Tuominen
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Eino Tuominen <eino@xxxxxx>
> *Sent:* Monday, July 10, 2017 14:35
> *To:* Gregory Farnum; Andras Pataki; ceph-users
> *Subject:* Re:  Degraded objects while OSD is being added/filled
>
>
> [replying to my post]
>
>
> In fact, I did just this:
>
>
> 1. On a HEALTH_OK cluster, command ceph osd in 245
>
> 2. wait cluster to stabilise
>
> 3. witness this:
>
>
>     cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
>
>      health HEALTH_WARN
>
>             385 pgs backfill_wait
>
>             1 pgs backfilling
>
>             33 pgs degraded
>
>             33 pgs recovery_wait
>
>             305 pgs stuck unclean
>
>             recovery 73550/278276590 objects degraded (0.026%)
>
>             recovery 5151479/278276590 objects misplaced (1.851%)
>
>      monmap e3: 3 mons at {0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}
>
>             election epoch 356, quorum 0,1,2 0,1,2
>
>      osdmap e397402: 260 osds: 260 up, 243 in; 386 remapped pgs
>
>             flags require_jewel_osds
>
>       pgmap v81108208: 25728 pgs, 8 pools, 203 TB data, 89746 kobjects
>
>             614 TB used, 303 TB / 917 TB avail
>
>             73550/278276590 objects degraded (0.026%)
>
>             5151479/278276590 objects misplaced (1.851%)
>
>                25293 active+clean
>
>                  385 active+remapped+wait_backfill
>
>                   33 active+recovery_wait+degraded
>
>                   16 active+clean+scrubbing+deep
>
>                    1 active+remapped+backfilling
>
>
> --
>   Eino Tuominen
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* Eino Tuominen
> *Sent:* Monday, July 10, 2017 14:25
> *To:* Gregory Farnum; Andras Pataki; ceph-users
> *Subject:* Re:  Degraded objects while OSD is being added/filled
>
>
> Hi Greg,
>
>
> I was not clear enough. First I set the weight to 0 (ceph osd out), I waited until the cluster was stable and healthy (all pgs active+clean). Then I went and removed the now empty osds. That was when I saw degraded objects. I'm soon about to add some new disks to the cluster. I can reproduce this on the cluster if you'd like to see what's happening. What would help to debug this? ceph osd dump and ceph pg dump before and after the modifications?
>
>
> --
>
>   Eino Tuominen
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* Gregory Farnum <gfarnum@xxxxxxxxxx>
> *Sent:* Thursday, July 6, 2017 19:20
> *To:* Eino Tuominen; Andras Pataki; ceph-users
> *Subject:* Re:  Degraded objects while OSD is being added/filled
>
>
>
> On Tue, Jul 4, 2017 at 10:47 PM Eino Tuominen <eino@xxxxxx <mailto:eino@xxxxxx>> wrote:
>
>     Hello,
>
>
>     I noticed the same behaviour in our cluster.
>
>
>     ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>
>
>
>         cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
>
>          health HEALTH_WARN
>
>                 1 pgs backfill_toofull
>
>                 4366 pgs backfill_wait
>
>                 11 pgs backfilling
>
>                 45 pgs degraded
>
>                 45 pgs recovery_wait
>
>                 45 pgs stuck degraded
>
>                 4423 pgs stuck unclean
>
>                 recovery 181563/302722835 objects degraded (0.060%)
>
>                 recovery 57192879/302722835 objects misplaced (18.893%)
>
>                 1 near full osd(s)
>
>                 noout,nodeep-scrub flag(s) set
>
>          monmap e3: 3 mons at {0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0 <http://130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0>}
>
>                 election epoch 356, quorum 0,1,2 0,1,2
>
>          osdmap e388588: 260 osds: 260 up, 242 in; 4378 remapped pgs
>
>                 flags nearfull,noout,nodeep-scrub,require_jewel_osds
>
>           pgmap v80658624: 25728 pgs, 8 pools, 202 TB data, 89212 kobjects
>
>                 612 TB used, 300 TB / 912 TB avail
>
>                 181563/302722835 objects degraded (0.060%)
>
>                 57192879/302722835 objects misplaced (18.893%)
>
>                    21301 active+clean
>
>                     4366 active+remapped+wait_backfill
>
>                       45 active+recovery_wait+degraded
>
>                       11 active+remapped+backfilling
>
>                        4 active+clean+scrubbing
>
>                        1 active+remapped+backfill_toofull
>
>     recovery io 421 MB/s, 155 objects/s
>
>       client io 201 kB/s rd, 2034 B/s wr, 75 op/s rd, 0 op/s wr
>
>
>     I'm currently doing a rolling migration from Puppet on Ubuntu to Ansible on RHEL, and I started with a healthy cluster, evacuated some nodes by setting their weight to 0, removed them from the cluster and re-added them with ansible playbook.
>
>     Basically I ran
>
>             ceph osd crush remove osd.$num
>
>             ceph osd rm $num
>
>             ceph auth del osd.$num
>
>
>     in a loop for the osds I was replacing, and then let the ansible ceph-osd playbook to bring the host back to the cluster. Crushmap is attached.
>
>
> This case is different. If you are removing OSDs before they've had the chance to offload themselves, objects are going to be degraded since you're removing a copy! :)
> -Greg
>
>
>     
>     --
>       Eino Tuominen
>
>
>     ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>     *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx <mailto:ceph-users-bounces@xxxxxxxxxxxxxx>> on behalf of Gregory Farnum <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>>
>     *Sent:* Friday, June 30, 2017 23:38
>     *To:* Andras Pataki; ceph-users
>     *Subject:* Re:  Degraded objects while OSD is being added/filled
>
>     On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx <mailto:apataki@xxxxxxxxxxxxxxxxxxxxx>> wrote:
>
>         Hi cephers,
>
>         I noticed something I don't understand about ceph's behavior when adding an OSD.  When I start with a clean cluster (all PG's active+clean) and add an OSD (via ceph-deploy for example), the crush map gets updated and PGs get reassigned to different OSDs, and the new OSD starts getting filled with data.  As the new OSD gets filled, I start seeing PGs in degraded states.  Here is an example:
>
>                   pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
>                         3164 TB used, 781 TB / 3946 TB avail
>             *            8017/994261437 objects degraded (0.001%)*
>                         2220581/994261437 objects misplaced (0.223%)
>                            42393 active+clean
>                               91 active+remapped+wait_backfill
>                                9 active+clean+scrubbing+deep
>             *                   1 active+recovery_wait+degraded*
>                                1 active+clean+scrubbing
>                                1 active+remapped+backfilling
>
>
>         Any ideas why there would be any persistent degradation in the cluster while the newly added drive is being filled?  It takes perhaps a day or two to fill the drive - and during all this time the cluster seems to be running degraded.  As data is written to the cluster, the number of degraded objects increases over time.  Once the newly added OSD is filled, the cluster comes back to clean again.
>
>         Here is the PG that is degraded in this picture:
>
>         7.87c    1    0    2    0    0    4194304    7    7    active+recovery_wait+degraded    2017-06-20 14:12:44.119921    344610'7    583572:2797    [402,521]    402    [402,521]    402    344610'7    2017-06-16 06:04:55.822503    344610'7    2017-06-16 06:04:55.822503
>
>         The newly added osd here is 521.  Before it got added, this PG had two replicas clean, but one got forgotten somehow?
>
>
>     This sounds a bit concerning at first glance. Can you provide some output of exactly what commands you're invoking, and the "ceph -s" output as it changes in response?
>
>     I really don't see how adding a new OSD can result in it "forgetting" about existing valid copies — it's definitely not supposed to — so I wonder if there's a collision in how it's deciding to remove old locations.
>
>     Are you running with only two copies of your data? It shouldn't matter but there could also be errors resulting in a behavioral difference between two and three copies.
>     -Greg
>
>
>
>         Other remapped PGs have 521 in their "up" set but still have the two existing copies in their "acting" set - and no degradation is shown.  Examples:
>
>         2.f24    14282    0    16    28564    0    51014850801    3102    3102    active+remapped+wait_backfill    2017-06-20 14:12:42.650308    583553'2033479    583573:2033266    [467,521]    467    [467,499]    467    582430'2033337    2017-06-16 09:08:51.055131    582036'2030837    2017-05-31 20:37:54.831178
>         6.2b7d    10499    0    140    20998    0    37242874687    3673    3673    active+remapped+wait_backfill    2017-06-20 14:12:42.070019    583569'165163    583572:342128    [541,37,521]    541    [541,37,532]    541    582430'161890    2017-06-18 09:42:49.148402    582430'161890    2017-06-18 09:42:49.148402
>
>         We are running the latest Jewel patch level everywhere (10.2.7).  Any insights would be appreciated.
>
>         Andras
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com