Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Jake Young <jak3kaj@xxxxxxxxx> · Sat, 07 Jan 2017 18:56:21 +0000

I use 2U servers with 9x 3.5" spinning disks in each. This has scaled well for me, in both performance and  budget. 

I may add 3 more spinning disks to each server at a later time if I need to maximize storage, or I may add 3 SSDs for journals/cache tier if we need better performance. 

Another consideration is failure domain. If you had a server crash, how much of your cluster will go down?  Some good advice I've read on this forum is no single OSD server should be more than 10% of the cluster.

I had taken a week off and one of my 12 OSD servers had an OS SD card fail, which took down the server. No one even noticed it went down. None of the VM clients had any performance issues and no data was lost (3x replication). I have the recovery settings turned down as low as possible, and even so it only took about 6 hours to rebuild.  

Speaking of rebuilding, do your performance measurements during a rebuild. This has been the time when the cluster is the most stressed and when performance is the most important. 

There's a lot to think about. Read through the archives of this mailing list, there is a lot of useful advice!

Jake

On Sat, Jan 7, 2017 at 1:38 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

Adding more nodes is best if you have unlimited budget :)
You should add more osds per node until you start hitting cpu or network bottlenecks. Use a perf tool like atop/sysstat to know when this happens.

-------- Original message --------
From: kevin parrikar <kevin.parker092@xxxxxxxxx> 
Date: 07/01/2017  19:56  (GMT+02:00) 
To: Lionel Bouton <lionel-subscription@xxxxxxxxxxx> 
Cc: ceph-users@xxxxxxxxxxxxxx 
Subject: Re:  Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release 

Wow thats a lot of good information. I wish i knew about all these before investing on all these devices.Since i dont have any other option,will get better SSD and faster HDD .
I have one more generic question about Ceph.
To increase the throughput of a cluster what is the standard practice is it more osd "per" node or more osd "nodes".

Thanks alot for all your help.Learned so many new things thanks again

Kevin

On Sat, Jan 7, 2017 at 7:33 PM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:

    Le 07/01/2017 à 14:11, kevin parrikar a
      écrit :

      Thanks for your valuable input.

        We were using these SSD in our NAS box(synology)  and it was
        giving 13k iops for our fileserver in raid1.We had a few spare
        disks which we added to our ceph nodes hoping that it will give
        good performance same as that of NAS box.(i am not comparing NAS
        with ceph ,just the reason why we decided to use these SSD)

        We dont have S3520 or S3610 at
          the moment but can order one of these to see how it performs
          in ceph .We have 4xS3500  80Gb handy.

          If i create a 2 node cluster with 2xS3500 each and with
          replica of 2,do you think it can deliver 24MB/s of 4k writes .

    Probably not. See
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

    According to the page above the DC S3500 reaches 39MB/s. Its
    capacity isn't specified, yours are 80GB only which is the lowest
    capacity I'm aware of and for all DC models I know of the speed goes
    down with the capacity so you probably will get lower than that.

    If you put both data and journal on the same device you cut your
    bandwidth in half : so this would give you an average <20MB/s per
    OSD (with occasional peaks above that if you don't have a sustained
    20MB/s). With 4 OSDs and size=2, your total write bandwidth is
    <40MB/s. For a single stream of data you will only get <20MB/s
    though (you won't benefit from parallel writes to the 4 OSDs and
    will only write on 2 at a time).

    Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s.

    But even if you reach the 40MB/s, these models are not designed for
    heavy writes, you will probably kill them long before their warranty
    is expired (IIRC these are rated for ~24GB writes per day over the
    warranty period). In your configuration you only have to write 24G
    each day (as you have 4 of them, write both to data and journal and
    size=2) to be in this situation (this is an average of only 0.28
    MB/s compared to your 24 MB/s target).

      We bought S3500
          because last time when we tried ceph, people were suggesting
          this model :) :) 

    The 3500 series might be enough with the higher capacities in some
    rare cases but the 80GB model is almost useless.

    You have to do the math considering :

    - how much you will write to the cluster (guess high if you have to
    guess),

    - if you will use the SSD for both journals and data (which means
    writing twice on them),

    - your replication level (which means you will write multiple times
    the same data),

    - when you expect to replace the hardware,

    - the amount of writes per day they support under warranty (if the
    manufacturer doesn't present this number prominently they probably
    are trying to sell you a fast car headed for a brick wall)

    If your hardware can't handle the amount of write you expect to put
    in it then you are screwed. There were reports of new Ceph users not
    aware of this and using cheap SSDs that failed in a matter of months
    all at the same time. You definitely don't want to be in their
    position.

    In fact as problems happen (hardware failure leading to cluster
    storage rebalancing for example) you should probably get a system
    able to handle 10x the amount of writes you expect it to handle and
    then monitor the SSD SMART attributes to be alerted long before they
    die and replace them before problems happen. You definitely want a
    controller allowing access to this information. If you can't you
    will have to monitor the writes and guess this value which is risky
    as write amplification inside SSDs is not easy to guess...

    Lionel

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com