Re: CRUSH straw2 can not handle big weight differences

Peter Linder <peter.linder@xxxxxxxxxxxxxx> · Mon, 29 Jan 2018 20:13:13 +0100



    I realize we're probably kind of pushing it. It was the only option
    i could think of however that would satisfy the idea that:

    
    Have separate servers for HDD and NVMe storage spread out in 3 data
    centers.

    Always select 1 NVMe and 2 HDD, in separate data centers (make sure
    NVMe is primary)

    If one data center goes down, only loose 1/3 of the NVMes.

    
    I tried making a ceph rule to first select an NVMe based on class,
    and then select 2 HDDs based on class. I couldn't make it guarantee
    however that they would be in separate data centers probably because
    of two separate chooseleaf statements. Sometimes one of the HDDs
    would end up being in the same one as the NVMe. I did play around
    with this for some time.

    
    Just selecting 3 separate ones instead sometimes resulted in 2 or 3
    NVMes, or no NVMes at all. In fact we do have a separate pool with
    3xNVMe for the high performance req stuff, but that uses a
    traditional "default" tree.

    
    Rearranging the osd map and reducing the rule to a single chooseleaf
    seems to work though and we will manually alter the weights outside
    of the hosts to make life easier for CRUSH :). 

    
    If we want to add more servers we will just add another layer in
    between and make sure the weights there do not differ too much when
    we plan it out. 

    
    /Peter

    
    Den 2018-01-29 kl. 17:52, skrev Gregory
      Farnum:

    
      CRUSH is a pseudorandom, probabilistic algorithm.
        That can lead to problems with extreme input.
        

        In this case, you've given it a bucket in which one child
          contains ~3.3% of the total weight, and there are only three
          weights. So on only 3% of "draws", as it tries to choose a
          child bucket to descend into, will it choose that small one
          first.
        And then you've forced it to select...each of the hosts in
          that data center, for all inputs? How can that even work in
          terms of actual data storage, if some of them are an order of
          magnitude larger than the others?
        

        Anyway, leaving that bit aside since it looks like you're
          mapping each host to multiple DCs, you're giving CRUSH a very
          difficult problem to solve. You can probably "fix" it by
          turning up the choose_retries value (or whatever it is) to a
          high enough level that trying to map a PG eventually actually
          grabs the small host. But I wouldn't be very confident in a
          solution like this; it seems very fragile and subject to input
          error.
        -Greg
      
      
        On Mon, Jan 29, 2018 at 6:45 AM Peter Linder <peter.linder@xxxxxxxxxxxxxx>
          wrote:

        
        We kind of
          turned the crushmap inside out a little bit.

          
          Instead of the traditional "for 1 PG, select OSDs from 3
          separate data

          centers" we did "force selection from only one datacenter (out
          of 3) and

          leave enough options only to make sure precisely 1 SSD and 2
          HDD are

          selected".

          
          We then organized these "virtual datacenters" in the hierachy
          so that

          one of them in fact contain 3 options that lead to 3
          physically separate

          servers in different locations.

          
          Every physical datacenter has both SSD's and HDD's. The idea
          is that if

          one datacenter is lost, 2/3 of the SSD's still remain (and can
          be mapped

          to by marking the missing ones "out") so performance is
          maintained.

          
          Den 2018-01-29 kl. 13:35, skrev Niklas:

          > Yes.

          > It is a hybrid solution where a placement group is always
          located on

          > one NVMe drive and two HDD drives. Advantage is great
          read performance

          > and cost savings. Disadvantages is low write performance.
          Still the

          > write performance is good thanks to rockdb on Intel
          Optane disks in

          > HDD servers.

          >

          > Real world looks more like I described in a previous
          question

          > (2018-01-23) here on ceph-users list, "Ruleset for
          optimized Ceph

          > hybrid storage". Nobody answered so am guessing it is not
          possible to

          > create my wanted rule. Now am trying to solve it with
          virtual

          > datacenters in the crush map. Which works but maybe the
          the most

          > optimal solution.

          >

          >

          > On 2018-01-29 13:21, Wido den Hollander wrote:

          >>

          >>

          >> On 01/29/2018 01:14 PM, Niklas wrote:

          >>> ...

          >>>

          >>

          >> Is it your intention to put all copies of a object in
          only one DC?

          >>

          >> What is your exact idea behind this rule? What's the
          purpose?

          >>

          >> Wido

          >>

          >> _______________________________________________

          >> ceph-users mailing list

          >> ceph-users@xxxxxxxxxxxxxx

          >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          >

          > _______________________________________________

          > ceph-users mailing list

          > ceph-users@xxxxxxxxxxxxxx

          > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com