Re: Ceph Journal Disk Size

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Thu, 02 Jul 2015 19:49:57 +0200



    On 07/02/15 19:13, Shane Gibson wrote:

    
          Lionel - thanks for the feedback ... inline below ... 
          

          On 7/2/15, 9:58 AM, "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx>
            wrote:
        
      
              Ouch. These spinning disks are probably a bottleneck:
              there are regular advices on this list to use one DC SSD
              for 4 OSDs. You would probably better off with a dedicated
              partition at the beginning of each OSD disk or worse one
              file on the filesystem but it should still be better than
              a shared spinning disk.
          
        
      I understand the benefit of journals on SSDs - but if you
        don't have them, you don't have them.  With that in mind, I'm
        completely open to any ideas on the "best structuring" of using
        7200 rpm disks with journal/osd device types.    I'm open to
        playing around with performance testing various scenarios.
         Again - we realize this is "less than optimal", but I would
        like to explore tweaking and tuning this setup for "the best
        possible performance" you can get out of it.  

      
    It's choosing between bad and worse. To keep it simple you write
    roughly as much on journals than on filestores. If the device you
    use for multiple journals is no better than the ones you use for
    filestores you introduce a bottleneck (each additional journal
    divides the available bandwidth). If you want to remove the
    bottleneck you either put one device per journal (you will have
    twice the bandwidth but half the storage space) or you use devices
    with more bandwidth (both sequential and random): SSDs.

    If you don't have access to SSDs, you have a compromise to reach
    between available space (journal stored on the same disk than the
    filestore) and performance (journal on a dedicated disk).

    If you put several journals on the same disk in your current
    configuration you most probably restrict both performance and
    available space. I wouldn't even try to put 2 journals on the same
    disk : you are already at the performance level of an OSD with
    filestore and journal on the same disk but you just sacrificed one
    third of your storage space.

    
              Anyway given that you get to use 720 disks (12 disks on 60
              servers), I'd still prefer your setup to mine (24 OSDs)
              even with what I consider a bottleneck your setup as
              probably far more bandwidth ;-)

            
      My understanding from reading the Ceph docs was that mixing
        Journal on the OSD disks was strongly considered a "very bad
        idea", due to the IO operations between the Journal and OSD disk
        itself creating contention.
    
    
    Yes this is true. But if you create even more contention elsewhere
    you are going from bad to worse.

    
        Like I said - I'm open to testing this configuration ...
        and probably will.  We're finalizing our build/deployment
        harness right now to be able to modify the architecture of the
        OSDs with a fresh build fairly easily. 
      

              A reaction to one of your earlier mails:

              You said you are going to 8TB drives. The problem isn't so
              much with the time needed to create new replicas when an
              OSD fails but the time to fill one freshly installed. The
              rebalancing is much faster when you add 4 x 2TB drives
              than 1 x 8TB drives.

            
      Why should it matter how long it takes a single drive to
        "fill"??
    
    
    This depends. Let's assume you have a stack of new drives stored for
    spares. If you use them to replace faulty drives while there is
    rebalancing going on (ie: pgs trying to reach size replicas) which
    is more and more likely when you have huge numbers of disks (which
    means less time when the whole cluster isn't repairing something
    somewhere) the bigger the disks are, the more contention they will
    bring and the longer your cluster will be repairing: in extreme
    cases you might see situations where you fall behind min_size for
    some pgs which would block some IOs or even lose data.

    You'll have to compute the probabilities for yourself given the
    likely scenario for your cluster (the risk might very well be
    negligible) but larger drives may not be safer.

    Another issue is performance : you'll get 4x more IOPS with 4 x 2TB
    drives than with one single 8TB. So if you have a performance target
    your money might be better spent on smaller drives

    
       Please note that I'm very very new to operating Ceph, so am
        working to understand these details - and I'm certain my
        understanding is still a bit ... simplistic ... :-) 
      

      If a drive failes, wouldn't the replica copies on that drive
        be replicated across "other OSD" devices when appropriate
        timers/triggers cause those data migration/re-replications to
        kick off?
    
    
    Yes.

    
      Subsequently, you add a new OSD and bring it online.
    
    
    With 720 disks "subsequently" might be replaced by "concurrently"
    and then see above. Let's say average practical MTBF is 3 years :
    you will get one failure every day and a half with on occasions some
    rapid successive failures during the same day, will you still be
    able to time your OSD creation to avoid contention while repair is
    going on?

    
        It's now ready to be used - and depending on your CRUSH map
        policies, will "start to fill" - yes, this process ... to "fill
        an entire 8TB drive" certainly would take a while, but that
        shouldn't block or degrade the entire cluster - since we have a
        replica copy set of 3 ... there are "two other replica copies"
        to service read requests.
    
    
    In fact not by default: each read always go to the primary OSD so
    your new disk is a bottleneck (unless you configure it initially to
    prevent it from becoming primary).

    
        If a replica copy is updated, which is currently in flight
        with the rebalancing to that new OSD, yes, I can see where there
        would be latency/delays/issues.   As the drive is rebalanced, is
        it marked "available" for new writes?  That would certainly
        cause significant latency with a new write request - I'd hope
        that during "rebalance" operation, that OSD disk is not marked
        available for new writes.
    
    
    It is for every pg already put on it. I'm not sure what happens to
    pg currently being moved but I assume writes are streamed to it
    after the initial sync (concurrent writes would probably not make
    sense).

    
      Which brings me to a question ... 
      

      Are there any good documents out there that detail
        (preferably via a flow chart/diagram or similar) how the various
        failure/recovery scenarios cause "change" or "impact" to the
        cluster?   I've seen very little in regards to this, but may be
        digging in the wrong places?  
      

      Thank you for any follow up information that helps illuminate
        my understanding (or lack thereof) how Ceph and failure/recovery
        situations should impact a cluster... 
      

      ~~shane 
      

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com