Re: HELP with some basics please

Denes Dolhay <denke@xxxxxxxxxxxx> · Mon, 4 Dec 2017 10:51:54 +0100



    Hi,
    I'm also new, but I'll try to help. IMHO most of the pros here
      would be quite worried about this cluster if it is production:
    -A prod ceph cluster should not be run with size=2 min_size=1,
      because:
    --In case of a down'ed osd / host the cluster could have problems
      determining which data is the correct when the osd/host came back
      up
    --If an osd dies, the others get more io (has to compensate the
      lost io capacity and the rebuilding too) which can instantly kill
      another close to death disc (not with ceph, but with raid i have
      been there)
    --If an osd dies ANY other osd serving that pool has well placed
      inconsistency, like bitrot you'll lose data

    
    -There are not enough hosts in your setup, or rather the discs
      are not distributed well:
    --If an osd / host dies, the cluster trys to repair itself and
      relocate the data onto another host. In your config there is no
      other host to reallocate data to if ANY of the hosts fail (I guess
      that hdds and ssds are separated)
    

    -The disks should nod be placed in raid arrays if it can be
      avoided especially raid0:
    --You multiply the possibility of an un-recoverable disc error
      (and since the data is striped) the other disks data is
      unrecoverable too
    --When an osd dies, the cluster should relocate the data onto
      another osd. When this happens now there is double the data that
      need to be moved, this causes 2 problems: Recovery time / io, and
      free space. The cluster should have enough free space to
      reallocate data to, in this setup you cannot do that in case of a
      host dies (see above), but in case an osd dies, ceph would try to
      replicate the data onto other osds in the machine. So you have to
      have enough free space on >>the same host<< in this
      setup to replicate data to.

    
    In your case, I would recommend:
    -Introducing (and activating) a fourth osd host
    -setting size=3 min_size=2
    -After data migration is done, one-by-one separating the raid0
      arrays: (remove, split) -> (zap, init, add) separately, in such
      a manner that hdds and ssds are evenly distributed across the
      servers
    -Always keeping that much free space, so the cluster could lose a
      host and still has space to repair (calculating with the repair
      max usage % setting).

    
    I hope this helps, and please keep in mind that I'm a noob too :)
    Denes.

    
    On 12/04/2017 10:07 AM, tim taler
      wrote:

    
      Hi
        I'm new to ceph but have to honor to look after a cluster
          that I haven't set up by myself.
        Rushing to the ceph docs and having a first glimpse on our
          cluster I start worrying about our setup,
        so I need some advice and guidance here.
        

        The set up is:
        3 machines, each running a ceph-monitor.
        all of them are also hosting OSDs 
        

        machine A:
        2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning
          disk)
        3 OSDs, each 3.6 TB - consisting each of a 2 disk
          hardware-raid 0 (spinning disk)
        3 OSDs, each 1.8 TB - consisting each of a 2 disk
          hardware-raid 0 (spinning disk)
        

        machine B:
        3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning
          disk)
        3 OSDs, each 3.6 TB - consisting each of a 2 disk
          hardware-raid 0 (spinning disk)

        
          1 OSDs, each 1.8 TB - consisting each of a 2 disk
            hardware-raid 0 (spinning disk)
        
        
        3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
        

        machine C:

        
        3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)

        
        the spinning disks and the SSD disks are forming two
          seperate pools.
        

        Now what I'm worrying about is that I read "don't use raid
          together with ceph"
        in combination with our poolsize
        :~ ceph osd pool get <poolname> size

        
        size: 2
        

        From what I understand from the ceph docu the size tell me
          "how many disks may fail" without loosing the data of the
          whole pool.
        Is that right? or can HALF the OSDs fail (since all objects
          are duplicated)?
        

        Unfortunately I'm not very good in stochastic but given a
          probability of 1% disk failure per year
        I'm not feeling very secure with this set up (How do I
          calculate the value that two disks fail "at the same time"? -
          or ahs anybody a rough number about that?)
        although looking at our OSD tree it seems we try to spread
          the objects always between two peers:
        

          ID  CLASS WEIGHT   TYPE NAME                      STATUS
            REWEIGHT PRI-AFF 
          -19        4.76700 root here_ssd                        
                         
          -15        2.38350     room 2_ssd                        
                          
          -14        2.38350         rack 2_ssd                    
                         
           -4        2.38350             host B_ssd                
                         
            4   hdd  0.79449                 osd.4              up
             1.00000 1.00000 
            5   hdd  0.79449                 osd.5              up
             1.00000 1.00000 
           13   hdd  0.79449                 osd.13             up
             1.00000 1.00000 
          -18        2.38350     room 1_ssd                        
                          
          -17        2.38350         rack 1_ssd                    
                         
           -5        2.38350             host C_ssd                
                     
            0   hdd  0.79449                 osd.0              up
             1.00000 1.00000 
            1   hdd  0.79449                 osd.1              up
             1.00000 1.00000 
            2   hdd  0.79449                 osd.2              up
             1.00000 1.00000 
           -1       51.96059 root here_spinning                    
                        
          -12       25.98090     room 2_spinning                  
                           
          -11       25.98090         rack 2_spinning              
                          
           -2       25.98090             host B_spinning          
                          
            3   hdd  3.99959                 osd.3              up
             1.00000 1.00000 
            8   hdd  3.99429                 osd.8              up
             1.00000 1.00000 
            9   hdd  3.99429                 osd.9              up
             1.00000 1.00000 
           10   hdd  3.99429                 osd.10             up
             1.00000 1.00000 
           11   hdd  1.99919                 osd.11             up
             1.00000 1.00000 
           12   hdd  3.99959                 osd.12             up
             1.00000 1.00000 
           20   hdd  3.99959                 osd.20             up
             1.00000 1.00000 
          -10       25.97969     room 1_spinning                  
                           
           -8       25.97969         rack l1_spinning              
                          
           -3       25.97969             host A_spinning          
                          
            6   hdd  3.99959                 osd.6              up
             1.00000 1.00000 
            7   hdd  3.99959                 osd.7              up
             1.00000 1.00000 
           14   hdd  3.99429                 osd.14             up
             1.00000 1.00000 
           15   hdd  3.99429                 osd.15             up
             1.00000 1.00000 
           16   hdd  3.99429                 osd.16             up
             1.00000 1.00000 
           17   hdd  1.99919                 osd.17             up
             1.00000 1.00000 
           18   hdd  1.99919                 osd.18             up
             1.00000 1.00000 
           19   hdd  1.99919                 osd.19             up
             1.00000 1.00000 
        
        
        And the second question
        I tracked the disk usage of our OSDs over the last two
          weeks and it looks somehow strange too:
        While osd.14, and osd.20 are filled only well below 60%
        the osd 9,16 and 18 are well about 80% 
        graphing that shows pretty stable parallel lines, with no
          hint of convergence
        That's true for both the HDD and the SSD pool.
        How is that and why and is that normal and okay or is there
          a(nother) glitch in our config?
        

        any hints and comments are welcome
        

        TIA
        

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com