Re: Impact of fancy striping

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Tue, 10 Dec 2013 15:51:38 -0800



    A general rule of thumb for separate
      journal devices is to use 1 SSD for every 4 OSDs.  Since SSDs have
      no seek penalty, 4 partitions are fine.  Going much above the 1:4
      ratio can saturate the SSD.

      
      On your SAS journal device, by creating 9 partitions, you're
      forcing head seeks for every journal write (assuming all 9 OSDs
      are writing).  Try using the SAS device with a single partition
      and 9 journals.  That gives you a change to get sequential IO. 
      For an anecdote of this effect, check out http://thedailywtf.com/Articles/The-Certified-DBA.aspx.

      
      Even then, I suspect you'll saturate the RAID0'ed SAS devices as
      they generally have less sequential IO than SSDs.

      
      I assume that you're aware that by using RAID0 for the journals, a
      single SAS disk failure will take down all 9 OSDs.

      
              Craig Lewis
              

               Senior Systems Engineer

                Office +1.714.602.1309

                Email clewis@xxxxxxxxxxxxxxxxxx
               
              Central Desktop.
                  Work together in ways you never thought possible.
                 

                   Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

                
      On 11/29/13 05:58 , nicolasc wrote:

    
    Hi
      James,
      

      Unfortunately, SSDs are out of budget. Currently there are 2 SAS
      disks in RAID0 on each node, split into 9 partitions: one for each
      OSD journal on the node. I benchmarked the RAID0 volumes at around
      500MB/s in sequential sustained write, so that's not bad — maybe
      access latency is also an issue?
      

      This journal problem is a bit of wizardry to me, I even had weird
      intermittent issues with OSDs not starting because the journal was
      not found, so please do not hesitate to suggest a better journal
      setup.
      

      I will try to look into this issue of device cache flush. Do you
      have a tracker link for the bug?
      

      Last question (for every one) is: which one of the journal config
      or the striping config has, in your opinion, the most influence on
      my "performance decreases with small blocks" problem?
      

      Best regards,
      

      Nicolas Canceill
      

      Scalable Storage Systems
      

      SURFsara (Amsterdam, NL)
      

      On 11/29/2013 02:06 PM, James Pearce wrote:
      

      Did you try moving the journals to
        separate SSDs?
        

        It was recently discovered that due to a kernel bug/design, the
        journal writes are translated into device cache flush commands,
        so thinking about that I wonder also whether there would be
        performance improvement in the case that journal and OSD are on
        the same physical drive implementing the workaround, since
        currently the system is presumably hitting spindle latency for
        every write?
        

        On 2013-11-29 12:46, nicolasc wrote:
        

        Hi every one,
          

          I am currently testing a use-case with large rbd images
          (several TB),
          

          each containing an XFS filesystem, which I mount on local
          clients. I
          

          have been testing the throughput writing on a single file in
          the XFS
          

          mount, using "dd oflag=direct", for various block sizes.
          

          With a default config, the "XFS writes with dd" show very good
          

          performances for 1GB blocks, but it drops down to average HDD
          

          performances for 4MB blocks, and to only a few MB/s for 4kB
          blocks.
          

          Changing the XFS block size did not help, so I tried fancy
          striping —
          

          max block size is 256kB in XFS anyway.
          

          First, using 4kB rados objects to store the 4kB stripes was
          awful,
          

          because rados does not like small objects. Then, I used fancy
          striping
          

          to store several 4kB stripes into a single 4MB object, but it
          hardly
          

          improved the performance with 4kB blocks, while drastically
          degrading
          

          the performance for large blocks.
          

          Given my use-case, the block size of writes cannot exceed 4MB.
          I do
          

          not know a lof of applications that write to disk by 1GB
          blocks.
          

          Currently, on a 6-nodes, 54-OSDs cluster, with journal on
          dedicated
          

          SAS disks and 10GbE dedicated uplink, I am getting
          performances
          

          equivalent to a basic local disc.
          

          So I am wondering: is it possible to have good performances
          with XFS
          

          on rbd images, using a reasonable block size?
          

          In case you think the answer is "yes", I would greatly
          appreciate it
          

          if you could gave me a clue about the striping magic involved.
          

          Best regards,
          

          Nicolas Canceill
          

          Scalable Storage Systems
          

          SURFsara (Amsterdam, NL)
          

          _______________________________________________
          

          ceph-users mailing list
          

          ceph-users@xxxxxxxxxxxxxx
          

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
          

        _______________________________________________
        

        ceph-users mailing list
        

        ceph-users@xxxxxxxxxxxxxx
        

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        

      _______________________________________________
      

      ceph-users mailing list
      

      ceph-users@xxxxxxxxxxxxxx
      

      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
      

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com