PCI-E SSD Journal for SSD-OSD Disks

chibi@xxxxxxx (Christian Balzer) · Fri, 16 May 2014 10:06:03 +0900

Hello,

On Thu, 15 May 2014 11:19:04 -0700 Tyler Wilson wrote:

> Hey All,
> 
> Thanks for the quick responses! I have chosen the micron pci-e card due
> to its benchmark results on
> http://www.storagereview.com/micron_realssd_p320h_enterprise_pcie_review .
> Per the vendor the card has
> a 25PB life expectancy so I'm not terribly worried about it failing on me
> too soon :)
> 

Ah yes. 
If I were the maker I'd put that information on my homepage in big
friendly letters. ^o^

> Christian Balzer <chibi at ...> writes:
> 
> >
> > On Wed, 14 May 2014 19:28:17 -0500 Mark Nelson wrote:
> >
> > > On 05/14/2014 06:36 PM, Tyler Wilson wrote:
> > > > Hey All,
> > >
> > > Hi!
> > >
> > > >
> > > > I am setting up a new storage cluster that absolutely must have the
> > > > best read/write sequential speed  <at>  128k and the highest IOps
> > > > at
> 4k
> > > > read/write as possible.
> > >
> > > I assume random?
> > >
> > > >
> > > > My current specs for each storage node are currently;
> > > > CPU: 2x E5-2670V2
> > > > Motherboard: SM X9DRD-EF
> > > > OSD Disks: 20-30 Samsung 840 1TB
> > > > OSD Journal(s): 1-2 Micron RealSSD P320h
> > > > Network: 4x 10gb, Bridged
> > I assume you mean 2x10Gb bonded for public and 2x10Gb for cluster
> > network?
> >
> > The SSDs you specified would read at about 500MB/s, meaning that only
> > 4 of them would already saturate your network uplink.
> > For writes (assuming journal on SSDs, see below) you reach that point
> > with just 8 SSDs.
> >
> 
> the 4x 10gb will be ceph-storage only traffic with public and management
> being on-board interfaces.
> This is expandable to 80Gbps if needed.
> 
So how much for client traffic then? 
With a single Micron card for journals and a max write speed of 1.9GB/s
you're already twice as fast for replication (really recovery/backfill)
traffic than you will ever manage. 
Those 40Gb/s for cluster network only start to make sense with 2 of those
cards.

> 
> > > > Memory: 32-96GB depending on need
> > RAM is pretty cheap these days and a large pagecache on the storage
> > nodes is always quite helpful.
> >
> 
> Noted, I wasn't sure how Ceph used the linux memory cache or if it would
> benefit us.
> 
> > > >
> >
> > How many of these nodes are you planning to deploy initially?
> > As always and especially when going for performance, more and smaller
> > nodes tend to be better, also less impact if one goes down.
> > And in your case it is easier to balance storage and network bandwidth,
> > see above.
> >
> 
> 2 storage nodes per location at start, these are serving OpenStack VM's
> so whenever it gets utilized
> enough to warrant more.
> 
Well, given the guestimated price tag of these nodes I'm not surprised. ^o^

However consider a node going down. Read performance will be halved and
while write performance in this case won't be affected much the
backfilling when the node comes back will have an impact.
If you can live with that, go ahead.

And while the endurance of the Micron cards is adequate, a single 800GB
Intel DC3700 has an endurance of nearly 15PB (compared to the 73TB of the
Samsungs). Of course they're likely to be 3 times  as expensive, too.
Maybe a DC 3500 would be an acceptable compromise between endurance and
price for you.
At a replication factor of 2 you really need to make sure that your device
quality is top notch and that a failure

In the end you need to do crunch the numbers for your use case yourself.
Aside from your budget you will want to look at the total number of
storage SSDs you're deploying and sum up their write capacity (IOPS and
sequential), then divided them by 2 and compare that to your journal
device(s) capacity. 
24 DC 3700s with journals on the same device will be slightly faster in
IOPS than the Micron card and 3 times as fast when it comes to sequential
writes. 
And of course significantly reducing the impact of a single device (shared
journal on the Micron) failure. 

Because if you where to have 2 Microns per node and one of them died, 25%
of your cluster would be "out and down", meaning that Ceph would try to
replicate things onto the surviving half of the OSDs on that node.
Meaning in turn that unless your storage utilization was below 45% or so,
your cluster would become full and lock up.

> > > > Does anyone see any potential bottlenecks in the above specs? What
> kind
> > > > of improvements or configurations can we make on the OSD config
> > > > side? We are looking to run this with 2 replication.
> > >
> > > Likely you'll run into latency due to context switching and lock
> > > contention in the OSDs and maybe even some kernel slowness.
>  Potentially
> > > you could end up CPU limited too, even with E5-2670s given how fast
> > > all of those SSDs are.  I'd suggest considering a chassis without an
> > > expander backplane and using multiple controllers with the drives
> > > directly attached.
> > >
> >
> > Indeed, I'd be worried about that as well, same with the
> > chassis/controller bit.
> >
> 
> 
> Thanks for the advise on the controller card, we will look into different
> chassis options w/ the LSI
> cards recommended on the InkTank docs.
> Would running a different distribution affect this at all? Our target was
> CentOS 6 however if a more
> recent kernel would make a difference we could switch.
> 
> > > There's work going into improving things on the Ceph side but I don't
> > > know how much of it has even hit our wip branches in github yet.  So
> for
> > > now ymmv, but there's a lot of work going on in this area as it's
> > > something that lots of folks are interested in.
> > >
> > If you look at the current "Slow IOPS on RBD compared to journal and
> > backing devices" thread and the Inktank document referenced in it
> >
> >
> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> 
> >
> > you should probably assume no more than 800 random write IOPS and 4000
> > random read IOPS per OSD (4KB block size).
> > That later number I can also reproduce with my cluster.
> >
> > Now I expect those numbers to go up as Ceph is improved, but for the
> > time being those limits might influence your choice of hardware.
> >
> > > I'd also suggest testing whether or not putting all of the journals
> > > on the RealSSD cards actually helps you that much over just putting
> > > your journals on the other SSDs.  The advantage here is that by
> > > putting journals on the 2.5" SSDs, you don't lose a pile of OSDs if
> > > one of
> those
> > > PCIE cards fails.
> > >
> > More than seconded, I could only find READ values on the Micron site
> > which makes me very suspicious, as the journal's main role is to be
> > able to WRITE as fast as possible. Also all journals combined ought to
> > be faster than your final storage.
> > Lastly there was no endurance data on the Micron site either and with
> > ALL your writes having to through those devices I'd be dead scared to
> > deploy them.
> >
> > I'd spend that money on the case and controllers as mentioned above and
> > better storage SSDs.
> >
> > I was going to pipe up about the Samsungs, but Mark Kirkwood did beat
> > me to it.
> > Unless you can be 100% certain that your workload per storage SSD
> > doesn't exceed 40GB/day I'd stay very clear of them.
> >
> > Christian
> >
> 
> Would it be possible to have redundant journals in this case? Per
> http://www.storagereview.com/micron_realssd_p320h_enterprise_pcie_reviewthe
> 350gb model has 25PB
> expectancy. On a purely IO/ps level from benchmarking with 4k writes the
> Micron is 25x faster than the
> Samsung 840's we tested with, hence the move to PCI-e journals.
>
The only way I can think of to make the journals redundant would be to
RAID1 them. 
Aside from melting your bus and CPUs it would of course also half the
write speed of them, making this rather moot.

Christian

> 
> > > The only other thing I would be careful about is making sure that
> > > your SSDs are good about dealing with power failure during writes.
> > > Not all SSDs behave as you would expect.
> > >
> > > >
> > > > Thanks for your guys assistance with this.
> > >
> > > np, good luck!
> > >
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users at ...
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at ...
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> 
> Thanks again for the responses!

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/