Hi, thanks for all the replies. I've found the issue: The Samsung nvme SSD has poor performance with sync=1. It reach only 4/5 k iops with randwrite ops. Using Intel DC S3700 SSDs I'm able to saturate the CPU. I'm using hammer v 0.94.5 on Ubuntu 14.04 and 3.19.0-31 kernel What do you think about Intel 750 series : http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html I plan to use it for cache layer ( one for host - is it a problem? ) Behind the cache layer I plan to use Mechanical HDD with Journal on SSD drives. What do you think about it? Thanks Regards, Matteo -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Somnath Roy Sent: lunedì 26 ottobre 2015 17:45 To: Christian Balzer <chibi@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx Subject: Re: BAD nvme SSD performance Another point, As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of a SSD before choosing that for Ceph.. Try to run with direct=1 and sync =1 with fio to a raw ssd drive.. Thanks & Regards Somnath -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Somnath Roy Sent: Monday, October 26, 2015 9:20 AM To: Christian Balzer; ceph-users@xxxxxxxxxxxxxx Subject: Re: BAD nvme SSD performance One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean you are saturating SSDs there..I have seen a large performance delta even if iostat is reporting 100% disk util in both the cases. Also, the ceph.conf file you are using is not optimal..Try to add these.. debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0 You didn't mention anything about your cpu, considering you have powerful cpu complex for SSDs tweak this to high number of shards..It also depends on number of OSDs per box.. osd_op_num_threads_per_shard osd_op_num_shards Don't need to change the following.. osd_disk_threads osd_op_threads Instead, try increasing.. filestore_op_threads Use the following in the global section.. ms_dispatch_throttle_bytes = 0 throttler_perf_counter = false Change the following.. filestore_max_sync_interval = 1 (or even lower, need to lower filestore_min_sync_interval as well) I am assuming you are using hammer and newer.. Thanks & Regards Somnath Try increasing the following to very big numbers.. > > filestore_queue_max_ops = 2000 > > > > filestore_queue_max_bytes = 536870912 > > > > filestore_queue_committing_max_ops = 500 > > > > filestore_queue_committing_max_bytes = 268435456 Use the following.. osd_enable_op_tracker = false -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Christian Balzer Sent: Monday, October 26, 2015 8:23 AM To: ceph-users@xxxxxxxxxxxxxx Subject: Re: BAD nvme SSD performance Hello, On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote: > > > On 26-10-15 14:29, Matteo Dacrema wrote: > > Hi Nick, > > > > > > > > I also tried to increase iodepth but nothing has changed. > > > > > > > > With iostat I noticed that the disk is fully utilized and write per > > seconds from iostat match fio output. > > > > Ceph isn't fully optimized to get the maximum potential out of NVME > SSDs yet. > Indeed. Don't expect Ceph to be near raw SSD performance. However he writes that per iostat the SSD is fully utilized. Matteo, can you run run atop instead of iostat and confirm that: a) utilization of the SSD is 100%. b) CPU is not the bottleneck. My guess would be these particular NVMe SSDs might just suffer from the same direct sync I/O deficiencies as other Samsung SSDs. This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, not data center one. http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831 Regards, Christian > For example, NVM-E SSDs work best with very high queue depths and > parallel IOps. > > Also, be aware that Ceph add multiple layers to the whole I/O > subsystem and that there will be a performance impact when Ceph is used in between. > > Wido > > > > > > > Matteo > > > > > > > > *From:*Nick Fisk [mailto:nick@xxxxxxxxxx] > > *Sent:* lunedì 26 ottobre 2015 13:06 > > *To:* Matteo Dacrema <mdacrema@xxxxxxxx>; ceph-users@xxxxxxxx > > *Subject:* RE: BAD nvme SSD performance > > > > > > > > Hi Matteo, > > > > > > > > Ceph introduces latency into the write path and so what you are > > seeing is typical. If you increase the iodepth of the fio test you > > should get higher results though, until you start maxing out your CPU. > > > > > > > > Nick > > > > > > > > *From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On > > Behalf Of *Matteo Dacrema > > *Sent:* 26 October 2015 11:20 > > *To:* ceph-users@xxxxxxxx <mailto:ceph-users@xxxxxxxx> > > *Subject:* BAD nvme SSD performance > > > > > > > > Hi all, > > > > > > > > I’ve recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a > > 2 OSD ceph cluster with min_size = 1. > > > > I’ve tested them with fio ad I obtained two very different results > > with these two situations with fio. > > > > This is the command : *fio --ioengine=libaio --direct=1 --name=test > > --filename=test --bs=4k --size=100M --readwrite=randwrite > > --numjobs=200 --group_reporting* > > > > > > > > On the OSD host I’ve obtained this result: > > > > *bw=575493KB/s, iops=143873* > > > > * * > > > > On the client host with a mounted volume I’ve obtained this result: > > > > > > > > Fio executed on the client osd with a mounted volume: > > > > *bw=9288.1KB/s, iops=2322* > > > > * * > > > > I’ve obtained this results with Journal and data on the same disk > > and also with Journal on separate SSD. > > > > * * > > > > I’ve two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ > > 2.00GHz and one MON host with 128GB of RAM and 2x Intel Xeon E5-2620 > > @ 2.00 GHz. > > > > I’m using 10G mellanox NIC and Switch with jumbo frames. > > > > > > > > I also did other test with this configuration ( see attached Excel > > workbook ) > > > > Hardware configuration for each of the two OSD nodes: > > > > 3x 100GB Intel SSD DC S3700 with 3 * 30 GB > > partition for every SSD > > > > 9x 1TB Seagate HDD > > > > Results: about *12k* IOPS with 4k bs and same fio test. > > > > > > > > I can’t understand where is the problem with nvme SSDs. > > > > Anyone can help me? > > > > > > > > Here the *ceph.conf:* > > > > [global] > > > > fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7 > > > > mon_initial_members = mon1 > > > > mon_host = 192.168.1.3 > > > > auth_cluster_required = cephx > > > > auth_service_required = cephx > > > > auth_client_required = cephx > > > > osd_pool_default_size = 2 > > > > mon_client_hung_interval = 1.0 > > > > mon_client_ping_interval = 5.0 > > > > public_network = 192.168.1.0/24 > > > > cluster_network = 192.168.1.0/24 > > > > mon_osd_full_ratio = .90 > > > > mon_osd_nearfull_ratio = .85 > > > > > > > > [mon] > > > > mon_warn_on_legacy_crush_tunables = false > > > > > > > > [mon.1] > > > > host = mon1 > > > > mon_addr = 192.168.1.3:6789 > > > > > > > > [osd] > > > > osd_journal_size = 30000 > > > > journal_dio = true > > > > journal_aio = true > > > > osd_op_threads = 24 > > > > osd_op_thread_timeout = 60 > > > > osd_disk_threads = 8 > > > > osd_recovery_threads = 2 > > > > osd_recovery_max_active = 1 > > > > osd_max_backfills = 2 > > > > osd_mkfs_type = xfs > > > > osd_mkfs_options_xfs = "-f -i size=2048" > > > > osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog" > > > > filestore_xattr_use_omap = false > > > > filestore_max_inline_xattr_size = 512 > > > > filestore_max_sync_interval = 10 > > > > filestore_merge_threshold = 40 > > > > filestore_split_multiple = 8 > > > > filestore_flusher = false > > > > filestore_queue_max_ops = 2000 > > > > filestore_queue_max_bytes = 536870912 > > > > filestore_queue_committing_max_ops = 500 > > > > filestore_queue_committing_max_bytes = 268435456 > > > > filestore_op_threads = 2 > > > > > > > > Best regards, > > > > Matteo > > > > > > > > > > Web Bug from http://xo4t.mj.am/o/xo4t/f8b6cd3d/qoi1l59e.gif > > -- > > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato > > non infetto. > > Clicca qui per segnalarlo come spam. > > <http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=326E9400C6.A1DC9> > > Clicca qui per metterlo in blacklist > > <http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=326E940 > > 0C6.A1DC9> > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto. Seguire il link qui sotto per segnalarlo come spam: http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=7E621400C6.AEF9C _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com