Re: Ceph full cluster

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 26 Sep 2016 13:05:18 +0200



    Hi,

    
    On 09/26/2016 12:58 PM, Dmitriy Lock
      wrote:

    
      Hello all!
        I need some help with my Ceph cluster.
        I've installed ceph cluster with two physical servers with
          osd /data 40G on each.
        Here is ceph.conf:
        [global]
            

            fsid = 377174ff-f11f-48ec-ad8b-ff450d43391c
            

            mon_initial_members = vm35, vm36
            

            mon_host = 192.168.1.35,192.168.1.36
            

            auth_cluster_required = cephx
            

            auth_service_required = cephx
            

            auth_client_required = cephx
            

            osd pool default size = 2  # Write an object 2 times.
            

            osd pool default min size = 1 # Allow writing one copy in a
            degraded state.
            

            osd pool default pg num = 200
            

            osd pool default pgp num = 200

          
        Right after creation it was HEALTH_OK, and i've started
          with filling it. I've wrote 40G data to cluster using Rados
          gateway, but cluster uses all avaiable space and keep growing
          after i've added two another osd - 10G /data1 on each server.
        Here is tree output:
        # ceph osd tree
            

            ID WEIGHT  TYPE NAME     UP/DOWN REWEIGHT PRIMARY-AFFINITY  

            -1 0.09756 root default                                     

            -2 0.04878     host vm35                                    

            0 0.03899         osd.0      up  1.00000          1.00000  

            2 0.00980         osd.2      up  1.00000          1.00000  

            -3 0.04878     host vm36                                    

            1 0.03899         osd.1      up  1.00000          1.00000  

            3 0.00980         osd.3      up  1.00000          1.00000 

          
        and health:
        root@vm35:/etc# ceph health
            

            HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs
            stuck unclean; 15 pgs undersized; recovery 87176/300483
            objects degraded (29.012%); recovery 62272/300483 obj

            ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s);
            pool default.rgw.buckets.data has many more objects per pg
            than average (too few pgs?)
            

            root@vm35:/etc# ceph health detail
            

            HEALTH_ERR 5 pgs backfill_toofull; 15 pgs degraded; 16 pgs
            stuck unclean; 15 pgs undersized; recovery 87176/300483
            objects degraded (29.012%); recovery 62272/300483 obj

            ects misplaced (20.724%); 1 full osd(s); 2 near full osd(s);
            pool default.rgw.buckets.data has many more objects per pg
            than average (too few pgs?)
            

            pg 10.5 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [1,0]
            

            pg 9.6 is stuck unclean since forever, current state
            active+undersized+degraded+remapped+backfill_toofull, last
            acting [1,0]
            

            pg 10.4 is stuck unclean since forever, current state
            active+remapped, last acting [3,0,1]
            

            pg 9.7 is stuck unclean since forever, current state
            active+undersized+degraded+remapped+backfill_toofull, last
            acting [1,0]
            

            pg 10.7 is stuck unclean since forever, current state
            active+undersized+degraded+remapped+backfill_toofull, last
            acting [0,1]
            

            pg 9.4 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [1,0]
            

            pg 9.1 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [0,3]
            

            pg 10.2 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [1,0]
            

            pg 9.0 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [1,2]
            

            pg 10.3 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [2,1]
            

            pg 9.3 is stuck unclean since forever, current state
            active+undersized+degraded+remapped+backfill_toofull, last
            acting [1,0]
            

            pg 10.0 is stuck unclean since forever, current state
            active+undersized+degraded+remapped+backfill_toofull, last
            acting [1,0]
            

            pg 9.2 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [0,1]
            

            pg 10.1 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [0,1]
            

            pg 9.5 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [1,0]
            

            pg 10.6 is stuck unclean since forever, current state
            active+undersized+degraded, last acting [0,1]
            

            pg 9.1 is active+undersized+degraded, acting [0,3]
            

            pg 10.2 is active+undersized+degraded, acting [1,0]
            

            pg 9.0 is active+undersized+degraded, acting [1,2]
            

            pg 10.3 is active+undersized+degraded, acting [2,1]
            

            pg 9.3 is
            active+undersized+degraded+remapped+backfill_toofull, acting
            [1,0]
            

            pg 10.0 is
            active+undersized+degraded+remapped+backfill_toofull, acting
            [1,0]
            

            pg 9.2 is active+undersized+degraded, acting [0,1]
            

            pg 10.1 is active+undersized+degraded, acting [0,1]
            

            pg 9.5 is active+undersized+degraded, acting [1,0]
            

            pg 10.6 is active+undersized+degraded, acting [0,1]
            

            pg 9.4 is active+undersized+degraded, acting [1,0]
            

            pg 10.7 is
            active+undersized+degraded+remapped+backfill_toofull, acting
            [0,1]
            

            pg 9.7 is
            active+undersized+degraded+remapped+backfill_toofull, acting
            [1,0]
            

            pg 9.6 is
            active+undersized+degraded+remapped+backfill_toofull, acting
            [1,0]
            

            pg 10.5 is active+undersized+degraded, acting [1,0]
            

            recovery 87176/300483 objects degraded (29.012%)
            

            recovery 62272/300483 objects misplaced (20.724%)
            

            osd.1 is full at 95%
            

            osd.2 is near full at 91%
            

            osd.3 is near full at 91%
            

            pool default.rgw.buckets.data objects per pg (12438) is more
            than 17.8451 times cluster average (697)

            
        In log i see this:
        2016-09-26 10:37:21.688849 mon.0
              192.168.1.35:6789/0
              4836 : cluster [INF] pgmap v8364: 144 pgs: 5
              active+undersized+degraded+remapped+backfill_toofull, 1
              active+remapped, 

            128 active+clean, 10 active+undersized+degraded; 33090 MB
            data, 92431 MB used, 9908 MB / 102340 MB avail; 87176/300483
            objects degraded (29.012%); 62272/300483 objects 

            misplaced (20.724%)
            

            2016-09-26 10:37:22.192322 osd.3 192.168.1.36:6804/3840 11 : cluster
            [WRN] OSD near full (91%)
            

            2016-09-26 10:37:38.295580 osd.1 192.168.1.36:6800/4014 16 : cluster
            [WRN] OSD near full (95%)

            
        How can i solve this
            issue? Why is my cluster using much more space than i fill
            (I've wrote 40G with two replica's, so i expect that cluster
            will use 80G data)
        What am i doing wrong?
      
    
    You are probably using a pool replication factor of 3 (33090 MB data
    vs 92431 MB used). You can check the pool replication factor using
    'ceph osd pool ls detail'; the 'size' value is the replication
    factor.

    
    You can change the replication factor on the fly by changing that
    value, but keep in mind that a replication factor of 2 is not
    recommended for production use. You may also want to adjust the
    min_size value.

    
    Regards,

    Burkhard
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com