Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I use 2U servers with 9x 3.5" spinning disks in each. This has scaled well for me, in both performance and  budget. 

I may add 3 more spinning disks to each server at a later time if I need to maximize storage, or I may add 3 SSDs for journals/cache tier if we need better performance. 

Another consideration is failure domain. If you had a server crash, how much of your cluster will go down?  Some good advice I've read on this forum is no single OSD server should be more than 10% of the cluster.

I had taken a week off and one of my 12 OSD servers had an OS SD card fail, which took down the server. No one even noticed it went down. None of the VM clients had any performance issues and no data was lost (3x replication). I have the recovery settings turned down as low as possible, and even so it only took about 6 hours to rebuild.  

Speaking of rebuilding, do your performance measurements during a rebuild. This has been the time when the cluster is the most stressed and when performance is the most important. 

There's a lot to think about. Read through the archives of this mailing list, there is a lot of useful advice!

Jake


On Sat, Jan 7, 2017 at 1:38 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:


Adding more nodes is best if you have unlimited budget :)
You should add more osds per node until you start hitting cpu or network bottlenecks. Use a perf tool like atop/sysstat to know when this happens.




-------- Original message --------
From: kevin parrikar <kevin.parker092@xxxxxxxxx>
Date: 07/01/2017 19:56 (GMT+02:00)
To: Lionel Bouton <lionel-subscription@xxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Wow thats a lot of good information. I wish i knew about all these before investing on all these devices.Since i dont have any other option,will get better SSD and faster HDD .
I have one more generic question about Ceph.
To increase the throughput of a cluster what is the standard practice is it more osd "per" node or more osd "nodes".

Thanks alot for all your help.Learned so many new things thanks again

Kevin

On Sat, Jan 7, 2017 at 7:33 PM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:





Le 07/01/2017 à 14:11, kevin parrikar a
écrit :



Thanks for your valuable input.

We were using these SSD in our NAS box(synology)  and it was
giving 13k iops for our fileserver in raid1.We had a few spare
disks which we added to our ceph nodes hoping that it will give
good performance same as that of NAS box.(i am not comparing NAS
with ceph ,just the reason why we decided to use these SSD)



We dont have S3520 or S3610 at
the moment but can order one of these to see how it performs
in ceph .We have 4xS3500  80Gb handy.

If i create a 2 node cluster with 2xS3500 each and with
replica of 2,do you think it can deliver 24MB/s of 4k writes .





Probably not. See
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/



According to the page above the DC S3500 reaches 39MB/s. Its
capacity isn't specified, yours are 80GB only which is the lowest
capacity I'm aware of and for all DC models I know of the speed goes
down with the capacity so you probably will get lower than that.

If you put both data and journal on the same device you cut your
bandwidth in half : so this would give you an average <20MB/s per
OSD (with occasional peaks above that if you don't have a sustained
20MB/s). With 4 OSDs and size=2, your total write bandwidth is
<40MB/s. For a single stream of data you will only get <20MB/s
though (you won't benefit from parallel writes to the 4 OSDs and
will only write on 2 at a time).



Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s.



But even if you reach the 40MB/s, these models are not designed for
heavy writes, you will probably kill them long before their warranty
is expired (IIRC these are rated for ~24GB writes per day over the
warranty period). In your configuration you only have to write 24G
each day (as you have 4 of them, write both to data and journal and
size=2) to be in this situation (this is an average of only 0.28
MB/s compared to your 24 MB/s target).




We bought S3500
because last time when we tried ceph, people were suggesting
this model :) :)





The 3500 series might be enough with the higher capacities in some
rare cases but the 80GB model is almost useless.



You have to do the math considering :

- how much you will write to the cluster (guess high if you have to
guess),

- if you will use the SSD for both journals and data (which means
writing twice on them),

- your replication level (which means you will write multiple times
the same data),

- when you expect to replace the hardware,

- the amount of writes per day they support under warranty (if the
manufacturer doesn't present this number prominently they probably
are trying to sell you a fast car headed for a brick wall)



If your hardware can't handle the amount of write you expect to put
in it then you are screwed. There were reports of new Ceph users not
aware of this and using cheap SSDs that failed in a matter of months
all at the same time. You definitely don't want to be in their
position.

In fact as problems happen (hardware failure leading to cluster
storage rebalancing for example) you should probably get a system
able to handle 10x the amount of writes you expect it to handle and
then monitor the SSD SMART attributes to be alerted long before they
die and replace them before problems happen. You definitely want a
controller allowing access to this information. If you can't you
will have to monitor the writes and guess this value which is risky
as write amplification inside SSDs is not easy to guess...



Lionel





_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux