Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 10 Jan 2017 20:35:12 +0100



    Hi,

      
      Le 10/01/2017 à 19:32, Brian Andrus a écrit :

    
      [...]
        
          
                I think the main point I'm trying to address is -
                  as long as the backing OSD isn't egregiously handling
                  large amounts of writes and it has a good journal in
                  front of it (that properly handles O_DSYNC [not D_SYNC
                  as Sebastien's article states]), it is unlikely
                  inconsistencies will occur upon a crash and subsequent
                  restart.
              
            
    I don't see how you can guess if it is "unlikely". If you need SSDs
    you are probably handling relatively large amounts of accesses (so
    large amounts of writes aren't unlikely) or you would have used
    cheap 7200rpm or even slower drives.

    
    Remember that in the default configuration, if you have any 3 OSDs
    failing at the same time, you have chances of losing data. For
    <30 OSDs and size=3 this is highly probable as there are only a
    few thousands combinations of 3 OSDs possible (and you usually have
    typically a thousand or 2 of pgs picking OSDs in a more or less
    random pattern).

    
    With SSDs not handling write barriers properly I wouldn't bet on
    recovering the filesystems of all OSDs properly given a cluster-wide
    power loss shutting down all the SSDs at the same time... In fact as
    the hardware will lie about the stored data, the filesystem might
    not even detect the crash properly and might apply its own journal
    on outdated data leading to unexpected results.

    So losing data is a possibility and testing for it is almost
    impossible (you'll have to reproduce all the different access
    patterns your Ceph cluster could experience at the time of a power
    loss and trigger the power losses in each case).

    
                Therefore - while not ideal to rely on journals to
                  maintain consistency,
              
            
    Ceph journals aren't designed for maintaining the filestore
    consistency. They *might* restrict the access patterns to the
    filesystems in such a way that running fsck on them after a "let's
    throw away committed data" crash might have better chances of
    restoring enough data but if it's the case it's only an happy
    coincidence (and you will have to run these fscks *manually* as the
    filesystem can't detect inconsistencies by itself).

    
                 that is what they are there for.
              
            
    No. They are here for Ceph internal consistency, not the filesystem
    backing the filestore consistency. Ceph relies both on journals and
    filesystems able to maintain internal consistency and supporting
    syncfs to maintain consistency, if the journal or the filesystem
    fails the OSD is damaged. If 3 OSDs are damaged at the same time on
    a size=3 pool you enter "probable data loss" territory.

    
                 There is a situation where "consumer-grade" SSDs
                  could be used as OSDs. While not ideal, it can and has
                  been done before, and may be preferable to tossing out
                  $500k of SSDs (Seen it firsthand!)
              
            
    For these I'd like to know :

    - which SSD models were used ?

    - how long did the SSDs survive (some consumer SSDs not only lie to
    the system about write completions but they usually don't handle
    large amounts of write nearly as well as DC models) ?

    - how many cluster-wide power losses did the cluster survive ?

    - what were the access patterns on the cluster during the power
    losses ?

    
    If for a model not guaranteed for sync writes there hasn't been
    dozens of power losses on clusters under large loads without any
    problem detected in the week following (thing deep-scrub), using
    them is playing Russian roulette with your data.

    
    AFAIK there have only been reports of data losses and/or heavy
    maintenance later when people tried to use consumer SSDs (admittedly
    mainly for journals). I've yet to spot long-running robust clusters
    built with consumer SSDs.

    
    Lionel

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com