Re: Default min_size value for EC pools

Florent B <florent@xxxxxxxxxxx> · Mon, 20 May 2019 21:33:57 +0200



    I understand better thanks to Frank
      & Paul messages.
    

    Paul, when min_size=k, is it the same
      problem with replicated pool size=2 & min_size=1 ?

    
    On 20/05/2019 21:23, Paul Emmerich
      wrote:

    
        Yeah, the current situation with recovery and min_size
          is... unfortunate :(
        

        The reason why min_size = k is bad is just that it means
          you are accepting writes without guaranteeing durability while
          you are in a degraded state.

        
        A durable storage system should never tell a client "okay,
          i've written your data" if losing a single disk leads to data
          loss.
        

        Yes, that is the default behavior of traditional raid 5 and
          raid 6 systems during rebuild (with 1 or 2 disk failures for
          raid 5/6), but that doesn't mean it's a good idea.
        

        Paul
        

            -- 

              Paul Emmerich

              
              Looking for help with your Ceph cluster? Contact us at https://croit.io

              
              croit GmbH

              Freseniusstr. 31h

              81247 München

              www.croit.io

              Tel: +49 89 1896585 90
          
          
        On Mon, May 20, 2019 at 7:37
          PM Frank Schilder <frans@xxxxxx> wrote:

        
        This
          is an issue that is coming up every now and then (for example:
          https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg50415.html)
          and I would consider it a very serious one (I will give an
          example below). A statement like "min_size = k is unsafe and
          should never be set" deserves a bit more explanation, because
          ceph is the only storage system I know of, for which k+m
          redundancy does *not* mean "you can loose up to m disks and
          still have read-write access". If this is really true then,
          assuming the same redundancy level, loosing service (client
          access) is significantly more likely with ceph than with other
          storage systems. And this has impact on design and storage
          pricing.

          
          However, some help seems on the way and an, in my opinion,
          utterly important feature update seems almost finished: https://github.com/ceph/ceph/pull/17619
          . It will implement the following:

          
          - recovery I/O happens as long as k shards are available (this
          is new)

          - client I/O will happen as long as min_size shards are
          available

          - recommended is min_size=k+1 (this might be wrong)

          
          This is pretty good and much better than the current behaviour
          (see below). This pull request also offers useful further
          information.

          
          Apparently, there is some kind of rare issue with erasure
          coding in ceph that makes it problematic to use min_size=k. I
          couldn't find anything better than vague explanations. Quote
          from the thread above: "Recovery on EC pools requires min_size
          rather than k shards at this time. There were reasons; they
          weren't great."

          
          This is actually a situation I was in. I once lost 2 failure
          domains simultaneously on an 8+2 EC pool and was really
          surprised that recovery stopped after some time with the worst
          degraded PGs remaining unfixed. I discovered the min_size=9
          (instead of 8) and "ceph health detail" recommended to reduce
          min_size. Before doing so, I searched the web (I mean, why the
          default k+1? Come on, there must be a reason.) and found some
          vague hints about problems with min_size=k during rebuild.
          This is a really bad corner to be in. A lot of PGs are already
          critically degraded and the only way forward was to make a bad
          situation worse, because reducing min_size would immediately
          enable client I/O in addition to recovery I/O.

          
          It looks like the default of min_size=k+1 will stay, because
          min_size=k does have some rare issues and these seem not to
          disappear. (I hope I'm wrong though.) Hence, if min_size=k
          will remain problematic, the recommendation should be "never
          to use m=1" instead of "never use min_size=k". In other words,
          instead of using a 2+1 EC profile, one should use a 4+2 EC
          profile. If one would like to have secure write access for n
          disk losses, then m>=n+1.

          
          If this issue remains, in my opinion this should be taken up
          in the best practices section. In particular, the
          documentation should not use examples with m=1, this gives the
          wrong impression. Either min_size=k is safe or not. If it is
          not, it should never be used anywhere in the documentation.

          
          I hope I marked my opinions and hypotheses clearly and that
          the links are helpful. If anyone could shed some light on as
          to why exactly min_size=k+1 is important, I would be grateful.

          
          Best regards,

          
          =================

          Frank Schilder

          AIT Risø Campus

          Bygning 109, rum S14

        
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com