Re: Degraded objects while OSD is being added/filled

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Thu, 20 Jul 2017 11:13:48 -0400



    Hi Greg,

    
    I have just now added a single drive/osd to a clean cluster, and can
    see the degradation immediately.  We are on ceph 10.2.9 everywhere.

    
    Here is how the cluster looked before the OSD got added:

        cluster d7b33135-0940-4e48-8aa6-1d2026597c2f

           health HEALTH_WARN

                  noout flag(s) set

           monmap e31: 3 mons at
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}

                  election epoch 46092, quorum 0,1,2
        cephmon00,cephmon01,cephmon02

            fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2
        up:standby

           osdmap e681227: 1270 osds: 1270 up, 1270 in

                  flags noout,sortbitwise,require_jewel_osds

            pgmap v54583934: 42496 pgs, 6 pools, 1488 TB data, 437
        Mobjects

                  4471 TB used, 3416 TB / 7887 TB avail

                     42491 active+clean

                         5 active+clean+scrubbing+deep

        client io 2193 kB/s rd, 27240 kB/s wr, 85 op/s rd, 47 op/s
        wr

    
    And this is shortly after it was added (after all the peering was
    done):

        cluster d7b33135-0940-4e48-8aa6-1d2026597c2f

           health HEALTH_WARN

                  141 pgs backfill_wait

                  117 pgs backfilling

                  20 pgs degraded

                  20 pgs recovery_wait

                  56 pgs stuck unclean

                  recovery 130/1376744346 objects degraded (0.000%)

                  recovery 3827502/1376744346 objects misplaced
        (0.278%)

                  noout flag(s) set

           monmap e31: 3 mons at
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}

                  election epoch 46092, quorum 0,1,2
        cephmon00,cephmon01,cephmon02

            fsmap e26638: 1/1/1 up {0=cephmon01=up:active}, 2
        up:standby

           osdmap e681238: 1271 osds: 1271 up, 1271 in; 258 remapped
        pgs

                  flags noout,sortbitwise,require_jewel_osds

            pgmap v54585141: 42496 pgs, 6 pools, 1488 TB data, 437
        Mobjects

                  4471 TB used, 3423 TB / 7895 TB avail

                  130/1376744346 objects degraded (0.000%)

                  3827502/1376744346 objects misplaced (0.278%)

                     42210 active+clean

                       141 active+remapped+wait_backfill

                       117 active+remapped+backfilling

                        20 active+recovery_wait+degraded

                         7 active+clean+scrubbing+deep

                         1 active+clean+scrubbing

      recovery io 17375 MB/s, 5069 objects/s

        client io 12210 kB/s rd, 29887 kB/s wr, 4 op/s rd, 140 op/s
        wr

    
    Even though there was no failure, we have 20 degraded PGs, and 130
    degraded objects.  My expectation was for some data to move around,
    start filling the added drive, but I would not expect to see
    degraded objects or PGs.

    
    Also, as time passes, the number of degraded objects increases
    steadily, here is a snapshot a little later:

        cluster d7b33135-0940-4e48-8aa6-1d2026597c2f

           health HEALTH_WARN

                  63 pgs backfill_wait

                  4 pgs backfilling

                  67 pgs stuck unclean

                  recovery 706/1377244134 objects degraded (0.000%)

                  recovery 843267/1377244134 objects misplaced
        (0.061%)

                  noout flag(s) set

           monmap e31: 3 mons at
{cephmon00=10.128.128.100:6789/0,cephmon01=10.128.128.101:6789/0,cephmon02=10.128.128.102:6789/0}

                  election epoch 46092, quorum 0,1,2
        cephmon00,cephmon01,cephmon02

            fsmap e26640: 1/1/1 up {0=cephmon01=up:active}, 2
        up:standby

           osdmap e681569: 1271 osds: 1271 up, 1271 in; 67 remapped
        pgs

                  flags noout,sortbitwise,require_jewel_osds

            pgmap v54588554: 42496 pgs, 6 pools, 1488 TB data, 437
        Mobjects

                  4471 TB used, 3423 TB / 7895 TB avail

                  706/1377244134 objects degraded (0.000%)

                  843267/1377244134 objects misplaced (0.061%)

                     42422 active+clean

                        63 active+remapped+wait_backfill

                         5 active+clean+scrubbing+deep

                         4 active+remapped+backfilling

                         2 active+clean+scrubbing

      recovery io 779 MB/s, 229 objects/s

        client io 306 MB/s rd, 344 MB/s wr, 138 op/s rd, 226 op/s wr

    
    From past experience, the degraded object count keeps going up for
    most of the time the disk is being filled.  Towards the end it
    decreases.  Is writing to a pool that is waiting for backfilling
    causing degraded objects to appear perhaps?

    
    I took a 'pg dump' before and after the change, as well as an 'osd
    tree' before and after.  All these are available at
http://voms.simonsfoundation.org:50013/m1Maf76sV1kS95spXQpijycmne92yjm/ceph-20170720/

    
    All pools are now with replicated size 3 and min size 2. Let me know
    if any other info would be helpful.

    
    Andras

    
    On 07/06/2017 02:30 PM, Andras Pataki
      wrote:

    
      Hi Greg,

      
      At the moment our cluster is all in balance.  We have one failed
      drive that will be replaced in a few days (the OSD has been
      removed from ceph and will be re-added with the replacement
      drive).  I'll document the state of the PGs before the addition of
      the drive and during the recovery process and report back.

      
      We have a few pools, most are on 3 replicas now, some with
      non-critical data that we have elsewhere are on 2.  But I've seen
      the degradation even on the 3 replica pools (I think in my
      original example there was an example of such a pool as well).

      
      Andras

      
      On 06/30/2017 04:38 PM, Gregory
        Farnum wrote:

      
        On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki
          <apataki@xxxxxxxxxxxxxxxxxxxxx>
          wrote:

          
               Hi cephers,

                
                I noticed something I don't understand about ceph's
                behavior when adding an OSD.  When I start with a clean
                cluster (all PG's active+clean) and add an OSD (via
                ceph-deploy for example), the crush map gets updated and
                PGs get reassigned to different OSDs, and the new OSD
                starts getting filled with data.  As the new OSD gets
                filled, I start seeing PGs in degraded states.  Here is
                an example:

                
                      pgmap v52068792: 42496 pgs, 6
                    pools, 1305 TB data, 390 Mobjects

                              3164 TB used, 781 TB / 3946 TB
                    avail

                              8017/994261437 objects
                      degraded (0.001%)

                              2220581/994261437 objects
                    misplaced (0.223%)

                                 42393 active+clean

                                    91
                    active+remapped+wait_backfill

                                     9
                    active+clean+scrubbing+deep

                                     1
                      active+recovery_wait+degraded

                                     1 active+clean+scrubbing

                                     1
                    active+remapped+backfilling

                  
                Any ideas why there would be any persistent degradation
                in the cluster while the newly added drive is being
                filled?  It takes perhaps a day or two to fill the drive
                - and during all this time the cluster seems to be
                running degraded.  As data is written to the cluster,
                the number of degraded objects increases over time. 
                Once the newly added OSD is filled, the cluster comes
                back to clean again.

                
                Here is the PG that is degraded in this picture:

                
                7.87c    1    0    2    0    0    4194304    7    7   
                active+recovery_wait+degraded    2017-06-20
                14:12:44.119921    344610'7    583572:2797   
                [402,521]    402    [402,521]    402    344610'7   
                2017-06-16 06:04:55.822503    344610'7    2017-06-16
                06:04:55.822503

                
                The newly added osd here is 521.  Before it got added,
                this PG had two replicas clean, but one got forgotten
                somehow?

              
            This sounds a bit concerning at first glance. Can you
              provide some output of exactly what commands you're
              invoking, and the "ceph -s" output as it changes in
              response?
            

            I really don't see how adding a new OSD can result in
              it "forgetting" about existing valid copies — it's
              definitely not supposed to — so I wonder if there's a
              collision in how it's deciding to remove old locations.
            

            Are you running with only two copies of your data? It
              shouldn't matter but there could also be errors resulting
              in a behavioral difference between two and three copies.
            -Greg
             
            
                Other remapped PGs have 521 in their "up" set but still
                have the two existing copies in their "acting" set - and
                no degradation is shown.  Examples:

                
                2.f24    14282    0    16    28564    0   
                51014850801    3102    3102   
                active+remapped+wait_backfill    2017-06-20
                14:12:42.650308    583553'2033479    583573:2033266   
                [467,521]    467    [467,499]    467   
                582430'2033337    2017-06-16 09:08:51.055131   
                582036'2030837    2017-05-31 20:37:54.831178

                6.2b7d    10499    0    140    20998    0   
                37242874687    3673    3673   
                active+remapped+wait_backfill    2017-06-20
                14:12:42.070019    583569'165163    583572:342128   
                [541,37,521]    541    [541,37,532]    541   
                582430'161890    2017-06-18 09:42:49.148402   
                582430'161890    2017-06-18 09:42:49.148402

                
                We are running the latest Jewel patch level everywhere
                (10.2.7).  Any insights would be appreciated.

                
                Andras

                
              _______________________________________________

              ceph-users mailing list

              ceph-users@xxxxxxxxxxxxxx

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com