Re: Impact of fancy striping

nicolasc <nicolas.canceill@xxxxxxxxxxx> · Thu, 12 Dec 2013 17:23:53 +0100

    Hi James, Robert, Craig,

    Thank your for those informative answers! You all pointed out
    interesting issues.

    I know losing 1 SAS disk in RAID0 means losing all journals, but
    this is for testing so I do not care.

    I do not think sequential write speed to the RAID0 array is the
    bottleneck (I benchmarked it at more than 500MB/s). However, I
    failed to realize that the synchronous writes of several OSDs would
    become random instead of sequential, thank you for explaining that.

    I want to try this setup with several journals on a single partition
    (to mitigate seek time), and I also want to try replacing my 9 OSDs
    (per node) by a big RAID0 array of 9 disks — leaving replication to
    Ceph. But first I wanted to get an idea of SSD performance, so I
    created a 1GB RAMdisk for every OSD journal.

    Shockingly, even with every journal on a dedicated RAMdisk, I still
    witnessed less than 100MB/s sequential writes with 4MB blocks. This
    is writing to an RBD image, independently of the format, the size,
    the striping pattern, or whether the image is mounted (with XFS on
    it) or directly accessed.

    So, maybe my journal setup is not satisfying, but the bottleneck
    seems to be somewhere else. Any idea at all about striping? Or maybe
    pool/PG config? (I blindly followed the PG ratios indicated in the
    docs).

    Thank you all for your help. Best regards,

    Nicolas Canceill

    Scalable Storage Systems

    SURFsara (Amsterdam, NL)

    On 12/06/2013 07:31 PM, Robert van
      Leeuwen wrote:

      If I understand correctly you have one sas disk as a journal for multiple OSDs.
If you do small synchronous writes it will become a IO bottleneck pretty quickly:
Due to multiple journals on the same disk it will no longer be sequential writes writes to one journal but  4k writes to x journals making it fully random.
I would expect a performance of 100 to 200 IOPS max.
Doing an iostat -x or atop should show this bottleneck immediately.
This is also the reason to go with SSDs: they have reasonable random IO performance.

Cheers,
Robert van Leeuwen

Sent from my iPad

        On 6 dec. 2013, at 17:05, "nicolasc" <nicolas.canceill@xxxxxxxxxxx> wrote:

Hi James,

Thank you for this clarification. I am quite aware of that, which is why the journals are on SAS disks in RAID0 (SSDs out of scope).

I still have trouble believing that fast-but-not-super-fast journals is the main reason for the poor performances observed. Maybe I am mistaken?

Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)

On 12/03/2013 03:01 PM, James Pearce wrote:

            I would really appreciate it if someone could:
- explain why the journal setup is way more important than striping settings;

          I'm not sure if it's what you're asking, but any write must be physically written to the journal before the operation is acknowledged.  So the overall cluster performance (or rather write latency) is always governed by the speed of those journals.  Data is then gathered up into (hopefully) larger blocks and committed to OSDs later.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    On 12/11/2013 12:51 AM, Craig Lewis
      wrote:

      A general rule of thumb for separate
        journal devices is to use 1 SSD for every 4 OSDs.  Since SSDs
        have no seek penalty, 4 partitions are fine.  Going much above
        the 1:4 ratio can saturate the SSD.

        On your SAS journal device, by creating 9 partitions, you're
        forcing head seeks for every journal write (assuming all 9 OSDs
        are writing).  Try using the SAS device with a single partition
        and 9 journals.  That gives you a change to get sequential IO. 
        For an anecdote of this effect, check out http://thedailywtf.com/Articles/The-Certified-DBA.aspx.

        Even then, I suspect you'll saturate the RAID0'ed SAS devices as
        they generally have less sequential IO than SSDs.

        I assume that you're aware that by using RAID0 for the journals,
        a single SAS disk failure will take down all 9 OSDs.

                Craig Lewis

                 Senior Systems Engineer

                  Office +1.714.602.1309

                  Email clewis@xxxxxxxxxxxxxxxxxx

                Central
                    Desktop. Work together in ways you never thought
                    possible.  

                     Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

        On 11/29/13 05:58 , nicolasc wrote:

      Hi
        James, 

        Unfortunately, SSDs are out of budget. Currently there are 2 SAS
        disks in RAID0 on each node, split into 9 partitions: one for
        each OSD journal on the node. I benchmarked the RAID0 volumes at
        around 500MB/s in sequential sustained write, so that's not bad
        — maybe access latency is also an issue? 

        This journal problem is a bit of wizardry to me, I even had
        weird intermittent issues with OSDs not starting because the
        journal was not found, so please do not hesitate to suggest a
        better journal setup. 

        I will try to look into this issue of device cache flush. Do you
        have a tracker link for the bug? 

        Last question (for every one) is: which one of the journal
        config or the striping config has, in your opinion, the most
        influence on my "performance decreases with small blocks"
        problem? 

        Best regards, 

        Nicolas Canceill 

        Scalable Storage Systems 

        SURFsara (Amsterdam, NL) 

        On 11/29/2013 02:06 PM, James Pearce wrote: 

        Did you try moving the journals to
          separate SSDs? 

          It was recently discovered that due to a kernel bug/design,
          the journal writes are translated into device cache flush
          commands, so thinking about that I wonder also whether there
          would be performance improvement in the case that journal and
          OSD are on the same physical drive implementing the
          workaround, since currently the system is presumably hitting
          spindle latency for every write? 

          On 2013-11-29 12:46, nicolasc wrote: 

          Hi every one, 

            I am currently testing a use-case with large rbd images
            (several TB), 

            each containing an XFS filesystem, which I mount on local
            clients. I 

            have been testing the throughput writing on a single file in
            the XFS 

            mount, using "dd oflag=direct", for various block sizes. 

            With a default config, the "XFS writes with dd" show very
            good 

            performances for 1GB blocks, but it drops down to average
            HDD 

            performances for 4MB blocks, and to only a few MB/s for 4kB
            blocks. 

            Changing the XFS block size did not help, so I tried fancy
            striping — 

            max block size is 256kB in XFS anyway. 

            First, using 4kB rados objects to store the 4kB stripes was
            awful, 

            because rados does not like small objects. Then, I used
            fancy striping 

            to store several 4kB stripes into a single 4MB object, but
            it hardly 

            improved the performance with 4kB blocks, while drastically
            degrading 

            the performance for large blocks. 

            Given my use-case, the block size of writes cannot exceed
            4MB. I do 

            not know a lof of applications that write to disk by 1GB
            blocks. 

            Currently, on a 6-nodes, 54-OSDs cluster, with journal on
            dedicated 

            SAS disks and 10GbE dedicated uplink, I am getting
            performances 

            equivalent to a basic local disc. 

            So I am wondering: is it possible to have good performances
            with XFS 

            on rbd images, using a reasonable block size? 

            In case you think the answer is "yes", I would greatly
            appreciate it 

            if you could gave me a clue about the striping magic
            involved. 

            Best regards, 

            Nicolas Canceill 

            Scalable Storage Systems 

            SURFsara (Amsterdam, NL) 

            _______________________________________________ 

            ceph-users mailing list 

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          _______________________________________________ 

          ceph-users mailing list 

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        _______________________________________________ 

        ceph-users mailing list 

        ceph-users@xxxxxxxxxxxxxx

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com