Re: multiple journals on SSD

Nick Fisk <nick@xxxxxxxxxx> · Fri, 8 Jul 2016 09:54:52 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Zoltan Arnold Nagy
> Sent: 08 July 2016 08:51
> To: Christian Balzer <chibi@xxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>; nick@xxxxxxxxxx
> Subject: Re:  multiple journals on SSD
> 
> Hi Christian,
> 
> 
> On 08 Jul 2016, at 02:22, Christian Balzer <mailto:chibi@xxxxxxx> wrote:
> 
> 
> Hello,
> 
> On Thu, 7 Jul 2016 23:19:35 +0200 Zoltan Arnold Nagy wrote:
> 
> 
> Hi Nick,
> 
> How large NVMe drives are you running per 12 disks?
> 
> In my current setup I have 4xP3700 per 36 disks but I feel like I could
> get by with 2… Just looking for community experience :-)
> This is funny, because you ask Nick about the size and don't mention it
> yourself. ^o^
> 
> You are absolutely right, my bad. We are using the 400GB models.
> 
> 
> As I speculated in my reply, it's the 400GB model and Nick didn't dispute
> that.
> And I shall assume the same for you.
> 
> You could get by with 2 of the 400GB ones, but that depends on a number of
> things.
> 
> 1. What's your use case, typical usage pattern?
> Are you doing a lot of large sequential writes or is it mostly smallish
> I/Os?
> HDD OSDs will clock in at about 100MB/s with OSD bench, but realistically
> not see more than 50-60MB/s, so with 18 of them per one 400GB P3700 you're
> about on par.
> 
> Our usage varies so much that it’s hard to put a finger on it.
> Some days it’s this, some days it’s that. Internal cloud with att bunch of researchers.

What I have seen is that where something like a SAS/SATA SSD will almost have a linear response of latency against load, NVME's start off with a shallower curve. You probably want to look at how high your current journals are getting hit. If they are much above 25-50% I would hesitate about putting too much more load on them for latency reasons, unless you are just going for big buffered write performance. You could probably drop down to maybe using 3 for every 12 disks though?

This set of slides were very interesting when I was planning my latest nodes.

https://indico.cern.ch/event/320819/contributions/742938/attachments/618990/851639/SSD_Benchmarking_at_CERN__HEPiX_Fall_2014.pdf

> 
> 
> 
> 2. What's your network setup? If you have more than 20Gb/s to that node,
> your journals will likely become the (write) bottleneck.
> But that's only the case with backfills or again largish sequential writes
> of course.
> Currently it’s bonded (LACP) 2x10Gbit for both the front and backend, but soon going to
> upgrade to 4x10Gbit front and 2x100Gbit back. (Already have a test cluster with this setup).
> 
> 
> 3. A repeat of sorts of the previous 2 points, this time with the focus on
> endurance. How much data are you writing per day to an average OSD?
> With 18 OSDs per 400GB P3700 NVMe you will want that to be less than
> 223GB/day/OSD.
> 
> We’re growing at around 100TB/month spread over ~130 OSDs at the moment which gives me ~25GB/OSD
> (I wish it would be that uniformly distributed :))
> 
> 
> 4. As usual, failure domains. In the case of a NVMe failure you'll loose
> twice the amount of OSDs.
> Right, but having a lot of nodes (20+) mitigates this somewhat.
> 
> 
> That all being said, at 36 OSDs I'd venture you'll run out of CPU steam
> (with small write IOPS) before your journals become the bottleneck.
> I agree, but that has not been the case so far.
> 
> 
> Christian
> 
> 
> Cheers,
> Zoltan
> [snip]
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> mailto:chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com