Re: BAD nvme SSD performance

Wido den Hollander <wido@xxxxxxxx> · Mon, 26 Oct 2015 14:35:19 +0100

On 26-10-15 14:29, Matteo Dacrema wrote:
> Hi Nick,
> 
>  
> 
> I also tried to increase iodepth but nothing has changed.
> 
>  
> 
> With iostat I noticed that the disk is fully utilized and write per
> seconds from iostat match fio output.
> 

Ceph isn't fully optimized to get the maximum potential out of NVME SSDs
yet.

For example, NVM-E SSDs work best with very high queue depths and
parallel IOps.

Also, be aware that Ceph add multiple layers to the whole I/O subsystem
and that there will be a performance impact when Ceph is used in between.

Wido

>  
> 
> Matteo
> 
>  
> 
> *From:*Nick Fisk [mailto:nick@xxxxxxxxxx]
> *Sent:* lunedì 26 ottobre 2015 13:06
> *To:* Matteo Dacrema <mdacrema@xxxxxxxx>; ceph-users@xxxxxxxx
> *Subject:* RE: BAD nvme SSD performance
> 
>  
> 
> Hi Matteo,
> 
>  
> 
> Ceph introduces latency into the write path and so what you are seeing
> is typical. If you increase the iodepth of the fio test you should get
> higher results though, until you start maxing out your CPU.
> 
>  
> 
> Nick
> 
>  
> 
> *From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On Behalf
> Of *Matteo Dacrema
> *Sent:* 26 October 2015 11:20
> *To:* ceph-users@xxxxxxxx <mailto:ceph-users@xxxxxxxx>
> *Subject:*  BAD nvme SSD performance
> 
>  
> 
> Hi all,
> 
>  
> 
> I’ve recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a 2
> OSD ceph cluster with min_size = 1.
> 
> I’ve tested them with fio ad I obtained two very different results with
> these two situations with fio.
> 
> This is the command : *fio  --ioengine=libaio --direct=1  --name=test
> --filename=test --bs=4k  --size=100M --readwrite=randwrite 
> --numjobs=200  --group_reporting*
> 
>  
> 
> On the OSD host I’ve obtained this result:
> 
> *bw=575493KB/s, iops=143873*
> 
> * *
> 
> On the client host with a mounted volume I’ve obtained this result:
> 
>  
> 
> Fio executed on the client osd with a mounted volume:
> 
> *bw=9288.1KB/s, iops=2322*
> 
> * *
> 
> I’ve obtained this results with Journal and data on the same disk and
> also with Journal on separate SSD.
> 
> * *
> 
> I’ve two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ 2.00GHz
> and one MON host with 128GB of RAM and 2x Intel Xeon E5-2620 @ 2.00 GHz.
> 
> I’m using 10G mellanox NIC and Switch with jumbo frames.
> 
>  
> 
> I also did other test with this configuration ( see attached Excel
> workbook )
> 
> Hardware configuration for each of the two OSD nodes:
> 
>                 3x  100GB Intel SSD DC S3700 with 3 * 30 GB partition
> for every SSD
> 
>                 9x  1TB Seagate HDD
> 
> Results: about *12k* IOPS with 4k bs and same fio test.
> 
>  
> 
> I can’t understand where is the problem with nvme SSDs.
> 
> Anyone can help me?
> 
>  
> 
> Here the *ceph.conf:*
> 
> [global]
> 
> fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7
> 
> mon_initial_members = mon1
> 
> mon_host = 192.168.1.3
> 
> auth_cluster_required = cephx
> 
> auth_service_required = cephx
> 
> auth_client_required = cephx
> 
> osd_pool_default_size = 2
> 
> mon_client_hung_interval = 1.0
> 
> mon_client_ping_interval = 5.0
> 
> public_network = 192.168.1.0/24
> 
> cluster_network = 192.168.1.0/24
> 
> mon_osd_full_ratio = .90
> 
> mon_osd_nearfull_ratio = .85
> 
>  
> 
> [mon]
> 
> mon_warn_on_legacy_crush_tunables = false
> 
>  
> 
> [mon.1]
> 
> host = mon1
> 
> mon_addr = 192.168.1.3:6789
> 
>  
> 
> [osd]
> 
> osd_journal_size = 30000
> 
> journal_dio = true
> 
> journal_aio = true
> 
> osd_op_threads = 24
> 
> osd_op_thread_timeout = 60
> 
> osd_disk_threads = 8
> 
> osd_recovery_threads = 2
> 
> osd_recovery_max_active = 1
> 
> osd_max_backfills = 2
> 
> osd_mkfs_type = xfs
> 
> osd_mkfs_options_xfs = "-f -i size=2048"
> 
> osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"
> 
> filestore_xattr_use_omap = false
> 
> filestore_max_inline_xattr_size = 512
> 
> filestore_max_sync_interval = 10
> 
> filestore_merge_threshold = 40
> 
> filestore_split_multiple = 8
> 
> filestore_flusher = false
> 
> filestore_queue_max_ops = 2000
> 
> filestore_queue_max_bytes = 536870912
> 
> filestore_queue_committing_max_ops = 500
> 
> filestore_queue_committing_max_bytes = 268435456
> 
> filestore_op_threads = 2
> 
>  
> 
> Best regards,
> 
> Matteo
> 
>  
> 
> 
> Web Bug from http://xo4t.mj.am/o/xo4t/f8b6cd3d/qoi1l59e.gif
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> infetto.
> Clicca qui per segnalarlo come spam.
> <http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=326E9400C6.A1DC9>
> Clicca qui per metterlo in blacklist
> <http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=326E9400C6.A1DC9>
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com