Hi Andrei,
If there is one thing I've come to understand by now is that ceph configs, performance, hw and well - everything - seems to vary on almost people basis.# uptime
16:24:57 up 611 days, 4:03, 1 user, load average: 1.18, 1.55, 1.72
# iostat -x[ ... ]
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.16 4.87 22.62 344.18 458.65 58.41 0.05 1.92 0.45 2.24 0.76 2.10
sdd 0.00 0.12 4.37 20.02 317.98 437.95 61.98 0.05 1.90 0.44 2.21 0.78 1.91
sde 0.00 0.12 4.17 19.33 302.45 403.02 60.02 0.04 1.87 0.43 2.18 0.77 1.80
sdf 0.00 0.12 4.51 20.84 322.84 439.70 60.17 0.05 1.84 0.43 2.15 0.76 1.93
[ ... ]
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.16 4.87 22.62 344.18 458.65 58.41 0.05 1.92 0.45 2.24 0.76 2.10
sdd 0.00 0.12 4.37 20.02 317.98 437.95 61.98 0.05 1.90 0.44 2.21 0.78 1.91
sde 0.00 0.12 4.17 19.33 302.45 403.02 60.02 0.04 1.87 0.43 2.18 0.77 1.80
sdf 0.00 0.12 4.51 20.84 322.84 439.70 60.17 0.05 1.84 0.43 2.15 0.76 1.93
[ ... ]
Granted, we do not have very high usage on this cluster on a ssd-basis and it might change as we put more load on it, but we will deal with it then. I do not think ~2ms access time is neither good nor bad.
This is from another cluster we operate - this one has an intel DC S3700 800gb ssd (sdb)
# uptime
09:37:26 up 654 days, 8:40, 1 user, load average: 0.33, 0.40, 0.54
09:37:26 up 654 days, 8:40, 1 user, load average: 0.33, 0.40, 0.54
# iostat -x
[ ... ]
sdb 0.01 1.49 39.76 86.79 1252.80 2096.98 52.94 0.02 0.76 1.22 0.54 0.41 5.21
[ ... ]
[ ... ]
sdb 0.01 1.49 39.76 86.79 1252.80 2096.98 52.94 0.02 0.76 1.22 0.54 0.41 5.21
[ ... ]
It is misleading as the latter just have 3 disks + hardware based 1gb backed raidcontroller whereas the first is a 'cheap' dumb 12disk jbod IT based setup.
All the ssd from both clusters have 3 partitions - 1 ceph-data and 2 journal partitions (1 journal for the ssd itself and 1 journal for 1 platter disk).
The intel ssd is very sturdy though - it has had a 2.1MB/sec avg. write over 654 days - that is somewhere around 120TB so far.
But ultimately it boils down to what you need - in our usecase the latter cluster has be to rockstable and performing - and we chose the intel ones based on that. The first one we don't really care if we loose a node or two and we replace disks every month or whenever it fits into our going-to-datacenter-schedule - we wanted an ok'ish performing cluster and focused more on total space / price than highperforming hardware. The fantastic thing is we are not locked into any specific hardware and we can replace any of it if we need to and/or find it is suddenly starting to have issues.
Cheers,
Martin
On Sat, Feb 28, 2015 at 2:55 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
Martin,
I have been using Samsung 840 Pro for journals about 2 years now and have just replaced all my samsung drives with Intel. We have found a lot of performance issues with 840 Pro (we are using 128mb). In particular, a very strange behaviour with using 4 partitions (with 50% underprovisioning left as empty unpartitioned space on the drive) where the drive would grind to almost a halt after a few weeks of use. I was getting 100% utilisation on the drives doing just 3-4MB/s writes. This was not the case when I've installed the new drives. Manual Trimming helps for a few weeks until the same happens again.
This has been happening with all 840 Pro ssds that we have and contacting Samsung Support has proven to be utterly useless. They do not want to speak with you until you install windows and run their monkey utility ((.
Also, i've noticed the latencies of the Samsung 840 Pro ssd drives to be about 15-20 slower compared with a consumer grade Intel drives, like Intel 520. According to ceph osd pef, I would consistently get higher figures on the osds with Samsung journal drive compared with the Intel drive on the same server. Something like 2-3ms for Intel vs 40-50ms for Samsungs.
At some point we had enough with Samsungs and scrapped them.
AndreiFrom: "Martin B Nielsen" <martin@xxxxxxxxxxx>
To: "Philippe Schwarz" <phil@xxxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Saturday, 28 February, 2015 11:51:57 AM
Subject: Re: Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernelMartinWhich is still not a lot, but I think it is more a limitation of our setup/load.Using the dd you had inside an existing semi-busy mysql-guest I get:They've written ~25TB data in avg each.We run a 8node mixed ssd/platter cluster with 4x samsung 840 pro (500gb) in each so that is 32x ssd.Hi,I cannot recognize that picture; we've been using samsumg 840 pro in production for almost 2 years now - and have had 1 fail.
102400000 bytes (102 MB) copied, 5.58218 s, 18.3 MB/sWe are using dumpling.All that aside, I would prob. go with something tried and tested if I was to redo it today - we haven't had any issues, but it is still nice to use something you know should have a baseline performance and can compare to that.Cheers,On Sat, Feb 28, 2015 at 12:32 PM, Philippe Schwarz <phil@xxxxxxxxxxxxxx> wrote:-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Le 28/02/2015 12:19, mad Engineer a écrit :
Hi, i'm new to ceph so, don't consider my words as holy truth.> Hello All,
>
> I am trying ceph-firefly 0.80.8
> (69eaad7f8308f21573c604f121956e64679a52a7) with 9 OSD ,all Samsung
> SSD 850 EVO on 3 servers with 24 G RAM,16 cores @2.27 Ghz Ubuntu
> 14.04 LTS with 3.16-3 kernel.All are connected to 10G ports with
> maximum MTU.There are no extra disks for journaling and also there
> are no separate network for replication and data transfer.All 3
> nodes are also hosting monitoring process.Operating system runs on
> SATA disk.
>
> When doing a sequential benchmark using "dd" on RBD, mounted on
> client as ext4 its taking 110s to write 100Mb data at an average
> speed of 926Kbps.
>
> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct
> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB)
> copied, 110.582 s, 926 kB/s
>
> real 1m50.585s user 0m0.106s sys 0m2.233s
>
> While doing this directly on ssd mount point shows:
>
> time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct
> 25000+0 records in 25000+0 records out 102400000 bytes (102 MB)
> copied, 1.38567 s, 73.9 MB/s
>
> OSDs are in XFS with these extra arguments :
>
> rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M
>
> ceph.conf
>
> [global] fsid = 7d889081-7826-439c-9fe5-d4e57480d9be
> mon_initial_members = ceph1, ceph2, ceph3 mon_host =
> 10.99.10.118,10.99.10.119,10.99.10.120 auth_cluster_required =
> cephx auth_service_required = cephx auth_client_required = cephx
> filestore_xattr_use_omap = true osd_pool_default_size = 2
> osd_pool_default_min_size = 2 osd_pool_default_pg_num = 450
> osd_pool_default_pgp_num = 450 max_open_files = 131072
>
> [osd] osd_mkfs_type = xfs osd_op_threads = 8 osd_disk_threads = 4
> osd_mount_options_xfs =
> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
>
>
> on our traditional storage with Full SAS disk, same "dd" completes
> in 16s with an average write speed of 6Mbps.
>
> Rados bench:
>
> rados bench -p rbd 10 write Maintaining 16 concurrent writes of
> 4194304 bytes for up to 10 seconds or 0 objects Object prefix:
> benchmark_data_ceph1_2977 sec Cur ops started finished avg MB/s
> cur MB/s last lat avg lat 0 0 0 0
> 0 0 - 0 1 16 94 78
> 311.821 312 0.041228 0.140132 2 16 192 176
> 351.866 392 0.106294 0.175055 3 16 275 259
> 345.216 332 0.076795 0.166036 4 16 302 286
> 285.912 108 0.043888 0.196419 5 16 395 379
> 303.11 372 0.126033 0.207488 6 16 501 485
> 323.242 424 0.125972 0.194559 7 16 621 605
> 345.621 480 0.194155 0.183123 8 16 730 714
> 356.903 436 0.086678 0.176099 9 16 814 798
> 354.572 336 0.081567 0.174786 10 16 832
> 816 326.313 72 0.037431 0.182355 11 16 833
> 817 297.013 4 0.533326 0.182784 Total time run:
> 11.489068 Total writes made: 833 Write size:
> 4194304 Bandwidth (MB/sec): 290.015
>
> Stddev Bandwidth: 175.723 Max bandwidth (MB/sec): 480 Min
> bandwidth (MB/sec): 0 Average Latency: 0.220582 Stddev
> Latency: 0.343697 Max latency: 2.85104 Min
> latency: 0.035381
>
> Our ultimate aim is to replace existing SAN with ceph,but for that
> it should meet minimum 8000 iops.Can any one help me with this,OSD
> are SSD,CPU has good clock speed,backend network is good but still
> we are not able to extract full capability of SSD disks.
>
>
>
> Thanks,
It seems that Samsung 840 (so i assume 850) are crappy for ceph :
MTBF :
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
Bandwidth
:http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045247.html
And according to a confirmed user of Ceph/ProxmoX, Samsung SSDs should
be avoided if possible in ceph storage.
Apart from that, it seems there was an limitation in ceph for the use
of the complete bandwidth available in SSDs; but i think with less
than 1Mb/s you haven't hit this limit.
I remind you that i'm not a ceph-guru (far from that, indeed), so feel
free to disagree; i'm on the way to improve my knowledge.
Best regards.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iEYEARECAAYFAlTxp0UACgkQlhqCFkbqHRb5+wCgrXCM3VsnVE6PCbbpOmQXCXbr
8u0An2BUgZWismSK0PxbwVDOD5+/UWik
=0o0v
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com