Maintaining write performance under a steady intake of small objects

Patrick Dinnen <pdinnen@xxxxxxxxx> · Mon, 1 May 2017 14:07:02 -0400



      Hello Ceph-users,
    
    Florian has been
        helping with some issues on our proof-of-concept cluster, where
        we've been experiencing these issues. Thanks for the replies so
        far. I wanted to jump in with some extra details.
    All of our testing has
        been with scrubbing turned off, to remove that as a factor.
    
      
        Our use case requires
              a Ceph cluster to indefinitely store ~10 billion files
              20-60KB in size. We’ll begin with 4 billion files migrated
              from a legacy storage system. Ongoing writes will be
              handled by ~10 client machines and come in at a fairly
              steady 10-20 million files/day. Every file (excluding the
              legacy 4 billion) will be read once by a single client
              within hours of it’s initial write to the cluster. Future
              file read requests will come from a single server and with
              a long-tail distribution, with popular files read
              thousands of times a year but most read never or virtually
              never.
      
      
Our “production”
                  design has 6-nodes, 24-OSDs (expandable to 48 OSDs).
                  SSD journals at a 1:4 ratio with HDDs, Each node looks
                  like this:

              
                  2
                      x E5-2660 8-core Xeons
                
                
                  64GB
                      RAM DDR-3 PC1600
                
                
                  10Gb
                      ceph-internal network (SFP+) 
                
                
                  LSI
                      9210-8i controller (IT mode)
                
                
                  4
                      x OSD 8TB HDDs, mix of two types
                
                
                    Seagate ST8000DM002
                  
                  
                    HGST HDN728080ALE604
                  
                  
                    Mount options = xfs
                        (rw,noatime,attr2,inode64,noquota) 
                  
                
                  1
                      x SSD journal Intel 200GB DC S3700
                
              
              Running Kraken 11.2.0
                  on Ubuntu 16.04. All testing has been done with a
                  replication level 2. We’re using rados bench to
                  shotgun a lot of files into our test pools.
                  Specifically following these two steps: 
              ceph osd pool create
                  poolofhopes 2048 2048 replicated "" replicated_ruleset
                  500000000
              rados -p poolofhopes
                  bench -t 32 -b 20000 30000000 write --no-cleanup
              

              We leave the bench
                  running for days at a time and watch the objects in
                  cluster count. We see performance that starts off
                  decent and degrades over time. There’s a very brief
                  initial surge in write performance after which things
                  settle into the downward trending pattern.
              

              1st hour - 2 million
                  objects/hour
              20th hour - 1.9
                  million objects/hour 
              40th hour - 1.7
                  million objects/hour
              

                This performance is not
                    encouraging for us. We need to be writing 40 million
                    objects per day (20 million files, duplicated
                    twice). The rates we’re seeing at the 40th hour of
                    our bench would be suffecient to achieve that.
                    Those write rates are still falling though and we’re
                    only at a fraction of the number of objects in
                    cluster that we need to handle. So, the trends in
                    performance suggests we shouldn’t count on having
                    the write performance we need for too long.
              
                  
                      If we repeat the process of creating a new pool
                      and running the bench the same pattern holds, good
                      initial performance that gradually degrades.
                  

                  https://postimg.org/image/ovymk7n2d/

                    
                  [caption:90 million objects written
                        to a brand new, pre-split pool (poolofhopes).
                        There are already 330 million objects on the
                        cluster in other pools.]

                        
                  Our
                      working theory is that the degradation over time
                      may be related to inode or dentry lookups that
                      miss cache and lead to additional disk reads and
                      seek activity. There’s a suggestion that filestore
                      directory splitting may exacerbate that problem as
                      additional/longer disk seeks occur related to
                      what’s in which XFS assignment group. We have
                      found pre-split pools useful in one major way,
                      they avoid periods of near-zero write performance
                      that we have put down to the active splitting of
                      directories (the "thundering herd" effect). The
                      overall downward curve seems to remain the same
                      whether we pre-split or not.
                  

                  The
                      thundering herd seems to be kept in check by an
                      appropriate pre-split. Bluestore may or may not be
                      a solution, but uncertainty and stability within
                      our fairly tight timeline don't recommend it
                      to us. Right now our big
                        question is "how can we avoid the gradual
                        degradation in write performance over time?". 
                  

                  Thank you, Patrick
                  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com