Re: Preconditioning an RBD image

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Thu, 23 Mar 2017 22:44:45 +0000



    Hi Nick,

      
      I didn't test with a colocated journal. I figure ceph knows what
      it's doing with the journal device, and it has no filesystem, so
      there's no xfs journal, file metadata, etc. to cache due to small
      random sync writes.

      
      I tested the bcache and journals on some SAS SSDs (rados bench was
      ok but real clients were really low bandwidth), and journals on
      NVMe (P3700) and bcache on some SAS SSDs, and also tested both on
      the NVMe. I think the performance is slightly better with it all
      on the NVMe (hdds being the bottleneck... tests in VMs show the
      same, but rados bench looks a tiny bit better). The bcache
      partition is shared by the osds, and the journals are separate
      partitions.

      
      I'm not sure it's really triple overhead. bcache doesn't write all
      your data to the writeback cache... just as much small sync writes
      as long as the cache doesn't fill up, or get too busy (based on
      await). And the bcache device flushes very slowly to the hdd, not
      overloading it (unless cache is full). And when I make it do it
      faster, it seems to do it more quickly than without bcache (like
      it does it more sequentially, or without sync; but I didn't really
      measure... just looked at, eg. 400MB dirty data, and then it
      flushes in 20 seconds). And if you overwrite the same data a few
      times (like a filesystem journal, or some fs metadata), you'd
      think it wouldn't have to write it more than once to the hdd in
      the end. Maybe that means something small like leveldb isn't
      written often to the hdd.

      
      And it's not just a write cache. The default is 10% writeback,
      which means the rest is read cache. And it keeps read stats so it
      knows which data is the most popular. My nodes right now show
      33-44% cache hits (cache is too small I think). And bcache
      reorders writes on the cache device so they are sequential, and
      can write to both at the same time so it can actually go faster
      than a pure ssd in specific situations (mixed sequential and
      random, only until the cache fills).

      
      I think I owe you another graph later when I put all my VMs on
      there (probably finally fixed my rbd snapshot hanging VM issue
      ...worked around it by disabling
      exclusive-lock,object-map,fast-diff). The bandwidth hungry ones
      (which hung the most often) were moved shortly after the bcache
      change, and it's hard to explain how it affects the graphs...
      easier to see with iostat while changing it and having a mix of
      cache and not than ganglia afterwards.

      
      Peter

      
      On 03/23/17 21:18, Nick Fisk wrote:

    
        Hi
            Peter,
         
        Interesting
            graph. Out of interest, when you use bcache, do you then
            just leave the journal collocated on the combined bcache
            device and rely on the writeback to provide journal
            performance, or do you still create a separate partition on
            whatever SSD/NVME you use, effectively giving triple write
            overhead?
         
        Nick
         
        
              From:
                  ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                  On Behalf Of Peter Maloney

                  Sent: 22 March 2017 10:06

                  To: Alex Gorbachev
                  <ag@xxxxxxxxxxxxxxxxxxx>; ceph-users
                  <ceph-users@xxxxxxxxxxxxxx>

                  Subject: Re:  Preconditioning an
                  RBD image
            
          
            Does iostat (eg.  iostat -xmy 1
              /dev/sd[a-z]) show high util% or await during these
              problems?

              
              Ceph filestore requires lots of metadata writing
              (directory splitting for example), xattrs, leveldb, etc.
              which are small sync writes that HDDs are bad at (100-300
              iops), and SSDs are good at (cheapo would be 6k iops, and
              not so crazy DC/NVMe would be 20-200k iops and more). So
              in theory, these things are mitigated by using an SSD,
              like bcache on your osd device. You could also try
              something like that, at least to test.

              
              I have tested with bcache in writeback mode and found
              hugely obvious differences seen by iostat, for example
              here's my before and after (heavier load due to converting
              week 49-50 or so, and the highest spikes being the scrub
              infinite loop bug in 10.2.3): 

              
              http://www.brockmann-consult.de/ganglia/graph.php?cs=10%2F25%2F2016+10%3A27&ce=03%2F09%2F2017+17%3A26&z=xlarge&hreg[]=ceph.*&mreg[]=sd[c-z]_await&glegend=show&aggregate=1&x=100

              
              But when you share a cache device, you get a single point
              of failure (and bcache, like all software, can be assumed
              to have bugs too). And I recommend vanilla kernel 4.9 or
              later which has many bcache fixes, or Ubuntu's 4.4 kernel
              which has the specific fixes I checked for.

              
              On 03/21/17 23:22, Alex Gorbachev wrote:
          
          
            I wanted to share the recent
              experience, in which a few RBD volumes, formatted as XFS
              and exported via Ubuntu NFS-kernel-server performed
              poorly, even generated an "out of space" warnings on a
              nearly empty filesystem.  I tried a variety of hacks and
              fixes to no effect, until things started magically working
              just after some dd write testing. 
            
               
              The only explanation I can come up
                with is that preconditioning, or thickening, the images
                with this benchmarking is what caused the improvement.
            
            
              Ceph is Hammer 0.94.7 running on
                Ubuntu 14.04, kernel 4.10 on OSD nodes and 4.4 on NFS
                nodes.
            
            
              Regards,
            
            
              Alex
            
            
              Storcium
            
            
              -- 
            
            
                -- 
                
                  Alex Gorbachev
                
                
                  Storcium
                
              
            _______________________________________________
            ceph-users mailing list
            ceph-users@xxxxxxxxxxxxxx
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
          
           
          -- 
           
          --------------------------------------------
          Peter Maloney
          Brockmann Consult
          Max-Planck-Str. 2
          21502 Geesthacht
          Germany
          Tel: +49 4152 889 300
          Fax: +49 4152 889 333
          E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
          Internet: http://www.brockmann-consult.de
          --------------------------------------------
        
      
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com