Re: Performance tuning for SAN SSD config

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 07 Jul 2018 01:15:11 +0200

Hi

What is your cpu utilization like ? are any cores close to saturation ?
If you use fio to test a raw FC lun (ie prior to adding as OSD) from your host using random 4k blocks and high queue depth (32 or more) , do you get high iops ? what is the disk utilization ? cpu utilization ?
If you repeat the above test but instead of testing 1 lun, run concurrent fio test on all 5 luns on the host, does the aggregate iops performance scale x5 ? any resource issues ?
Does increasing /sys/block/sdX/queue/nr_requests help ?
Can you use active/active multipath ?
If the above gives good performance/resource utilization, would you get better performance if you had more that 20 OSDs/luns in total, for example 40 or 60 ? that should not cost you anything.
I still think you can use replica of 1 in Ceph since your SAN already has redundancy. It maybe an over-kill to use both. I am not trying to save space on the SAN but rather reduce write latency on the Ceph side. 

Maged
On 2018-07-06 20:19, Matthew Stroud wrote:

Good to note about the replica set, we will stick with 3. We really aren't concerned about the overhead, but the additional IO that occurs during writes that have an additional copy.

 To be clear, we aren't using ceph in place of FC, nor the other way around. We have discovered that SAN storage is cheaper (this one was surprising to me) and better performant than direct attached storage (DAS) on the small scale that we are building things (20T to about 100T). I'm sure that would switch if we were much larger, but for now SAN is better. In summary we are using SAN pretty much as a DAS and ceph uses those SAN disks for OSDs.

 The biggest issue we see is slow requests during rebuilds or node/osd failures but the disks and network just aren't being to their fullest. That would lead me to believe that there are some host and/or osd process bottlenecks going on. Other than that, just increasing the performance of our ceph cluster would be a plus and that is what I'm exploring.

 As per test numbers, I can't run that right now because the systems we have are in prod and I don't want to impact that for io testing. However, we do have a new cluster coming online shortly and I could do some benchmarking there and get that back to you.

 However as memory serves, we were only getting something about 90-100k iops and about 15 - 50 ms latency with 10 servers running fio with 50% of random and sequential workloads. With a single vm, we were getting about 14k iops with about 10 - 30 ms of latency.

 Thanks,
 Matthew Stroud

 On 7/6/18, 11:12 AM, "Vasu Kulkarni" <vakulkar@xxxxxxxxxx> wrote:

     On Fri, Jul 6, 2018 at 8:38 AM, Matthew Stroud <mattstroud@xxxxxxxxxxxxx> wrote:
     >
     > Thanks for the reply.
     >
     >
     >
     > Actually we are using fiber channel (it's so much more performant than iscsi in our tests) as the primary storage and this is serving up traffic for RBD for openstack, so this isn't for backups.
     >
     >
     >
     > Our biggest bottle neck is trying utilize the host and/or osd process correctly. The disks are running at sub-millisecond, with about 90% of the IO being pulled from the array's cache (a.k.a. not even hitting the disks). According to the host, we never get north of 20% disk utilization, unless there is a deep scrub going on.
     >
     >
     >
     > We have debated about putting the replica size to 2 instead of 3. However this isn't much of a win for the purestorage which dedupes on the backend, so having copies of data are relatively free for that unit. 1 wouldn't work because this is hosting a production work load.

     It is a mistake to use replica of 2 for production, when one of the
     copy is corrupted its hard to fix things. if you are concerned about
     storage overhead there is an option to use EC pools in luminous.  To
     get back to your original question if you are comparing the
     network/disk utilization with FC numbers than that is wrong
     comparison,  They are 2 different storage systems with different
     purposes, Ceph is scale out object storage system unlike FC systems
     where you can use commodity hardware and grow as you need, you
     generally dont need hba/fc enclosed disks but nothing stopping you
     from using your existing system. Also you generally dont need any raid
     mirroring configurations in the backend since ceph will handle the
     redundancy for you. scale out systems have more work to do than
     traditional FC systems. There are minimal configuration options for
     bluestore , what kind of disk/network utilization slowdown you are
     seeing? can you publish your numbers and test data?
     >
     > Thanks,
     >
     > Matthew Stroud
     >
     >
     > From: Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
     > Date: Friday, July 6, 2018 at 7:01 AM
     > To: Matthew Stroud <mattstroud@xxxxxxxxxxxxx>
     > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
     > Subject: Re: [ceph-users] Performance tuning for SAN SSD config
     >
     >
     > On 2018-06-29 18:30, Matthew Stroud wrote:
     >
     > We back some of our ceph clusters with SAN SSD disk, particularly VSP G/F and Purestorage. I'm curious what are some settings we should look into modifying to take advantage of our SAN arrays. We had to manually set the class for the luns to SSD class which was a big improvement. However we still see situations where we get slow requests and the underlying disks and network are underutilized.
     >
     >
     > More info about our setup. We are running centos 7 with Luminous as our ceph release. We have 4 osd nodes that have 5x2TB disks each and they are setup as bluestore. Our ceph.conf is attached with some information removed for security reasons.
     >
     >
     > Thanks ahead of time.
     >
     > Thanks,
     > Matthew Stroud
     >
     > ________________________________
     >
     >
     > CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you.
     >
     > _______________________________________________
     > ceph-users mailing list
     > ceph-users@xxxxxxxxxxxxxx
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
     >
     >
     > If i understand correctly, you are using luns (via iSCSI) from your external SAN as OSDs and created a separate pool with these OSDs with device class SSD, you are using this pool for backup.
     >
     > Some comments:
     >
     > Using external disks as OSDs is probably not that common. It may be better to keep the SAN and Ceph cluster separate and have your backup tool access both, it will also be safer in case of disaster to the cluster your backup will be on a separate system.
     > What backup tool/script are you using ? it is better that this tool uses high queue depth, large block sizes and memory/page cache to increase performance during copies.
     > To try to pin down where your current bottleneck is, i would run benchmarks (eg fio) using the block sizes used by your backup tool on the raw luns before being added as OSDs (as pure iSCSI disks) as well as on both the main and backup pools. Have a resource tool (eg atop/systat/collectl) run during these tests to check for resources: disks %busy/cores %busy/io_wait
     > You probably can use replica count of 1 for the SAN OSDs since they include their own RAID redundancy.
     >
     > Maged
     >
     >
     > ________________________________
     >
     > CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you.
     >
     > _______________________________________________
     > ceph-users mailing list
     > ceph-users@xxxxxxxxxxxxxx
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >

 ________________________________

 CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com