Re: Optimal OSD count for SSDs / NVMe disks

Wade Holler <wade.holler@xxxxxxxxx> · Thu, 04 Feb 2016 13:38:44 +0000

First on your comment of:
"we found that during times where the cache pool flushed to
the storage pool client IO took a severe hit"

We found the same thing. http://blog.wadeit.io/ceph-cache-tier-performance-random-writes/
-- I don't claim this is a great write up, and not what a lot of folks are interested in but it is what I was after.

Great on your fio test.  However take a look at the response time.  Naturally it will increase after 4-5 concurrent writes.  Which is of course what you were saying and is correct.  However, I think we can generally accept a slightly higher response time and therefore iodepth>1 is a more real world test.  Just my thoughts. You did the right thing, and tested well. 

Some might not like it , but I like Sebastien's journal size calculation and it has served me well:
http://slides.com/sebastienhan/ceph-performance-and-benchmarking#/24

Cheers
Wade

On Thu, Feb 4, 2016 at 7:24 AM Sascha Vogt <sascha.vogt@xxxxxxxxx> wrote:
Hi,

Am 04.02.2016 um 12:59 schrieb Wade Holler:

> You referenced parallel writes for journal and data. Which is default

> for btrfs but but XFS. Now you are mentioning multiple parallel writes

> to the drive , which of course yes will occur.

Ah, that is good to know. So if I want to create more "parallelism" I

should use btrfs then. Thanks a lot, that's a very critical bit of

information :)

> Also Our Dell 400 Gb NVMe drives do not top out around 5-7 sequential

> writes as you mentioned. That would be 5-7 random writes from a drives

> perspective and the NVMe drives can do many times that.

Hm, I used the following fio bench from [1]:

fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k

--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting

--name=journal-test

Our disks showed the following bandwidths: (#<no> is the numjobs

paramenter):

#1: write: io=1992.2MB, bw=33997KB/s, iops=8499

#2: write: io=5621.6MB, bw=95940KB/s, iops=23984

#3: write: io=8062.8MB, bw=137602KB/s, iops=34400

#4: write: io=9114.1MB, bw=155545KB/s, iops=38886

#5: write: io=8860.7MB, bw=151169KB/s, iops=37792

Also for more jobs (tried up to 8) bandwidth stayed at around 150MB/s

and around 37k iops. So I figured that around 5 should be the sweet spot

in terms for journals on a single disk.

> I would park it at 5-6 partitions per NVMe , journal on the same disk.

> Frequently I want more concurrent operations , rather than all out

> throughput.

For journal on the same partition, should I limit the size of the

journal size? If yes, what should be the limit? Rather large or rather

small?

Greetings

-Sascha-

[1]http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com