PCI-E SSD Journal for SSD-OSD Disks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 14 May 2014 19:28:17 -0500 Mark Nelson wrote:

> On 05/14/2014 06:36 PM, Tyler Wilson wrote:
> > Hey All,
> 
> Hi!
> 
> >
> > I am setting up a new storage cluster that absolutely must have the
> > best read/write sequential speed @ 128k and the highest IOps at 4k
> > read/write as possible.
> 
> I assume random?
> 
> >
> > My current specs for each storage node are currently;
> > CPU: 2x E5-2670V2
> > Motherboard: SM X9DRD-EF
> > OSD Disks: 20-30 Samsung 840 1TB
> > OSD Journal(s): 1-2 Micron RealSSD P320h
> > Network: 4x 10gb, Bridged
I assume you mean 2x10Gb bonded for public and 2x10Gb for cluster network?

The SSDs you specified would read at about 500MB/s, meaning that only 4 of
them would already saturate your network uplink.
For writes (assuming journal on SSDs, see below) you reach that point with
just 8 SSDs.

> > Memory: 32-96GB depending on need
RAM is pretty cheap these days and a large pagecache on the storage nodes
is always quite helpful.

> >

How many of these nodes are you planning to deploy initially?
As always and especially when going for performance, more and smaller
nodes tend to be better, also less impact if one goes down.
And in your case it is easier to balance storage and network bandwidth,
see above.

> > Does anyone see any potential bottlenecks in the above specs? What kind
> > of improvements or configurations can we make on the OSD config side?
> > We are looking to run this with 2 replication.
> 
> Likely you'll run into latency due to context switching and lock 
> contention in the OSDs and maybe even some kernel slowness.  Potentially 
> you could end up CPU limited too, even with E5-2670s given how fast all 
> of those SSDs are.  I'd suggest considering a chassis without an 
> expander backplane and using multiple controllers with the drives 
> directly attached.
> 

Indeed, I'd be worried about that as well, same with the
chassis/controller bit.

> There's work going into improving things on the Ceph side but I don't 
> know how much of it has even hit our wip branches in github yet.  So for 
> now ymmv, but there's a lot of work going on in this area as it's 
> something that lots of folks are interested in.
> 
If you look at the current "Slow IOPS on RBD compared to journal and
backing devices" thread and the Inktank document referenced in it

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf 

you should probably assume no more than 800 random write IOPS and 4000
random read IOPS per OSD (4KB block size). 
That later number I can also reproduce with my cluster.

Now I expect those numbers to go up as Ceph is improved, but for the time
being those limits might influence your choice of hardware.

> I'd also suggest testing whether or not putting all of the journals on 
> the RealSSD cards actually helps you that much over just putting your 
> journals on the other SSDs.  The advantage here is that by putting 
> journals on the 2.5" SSDs, you don't lose a pile of OSDs if one of those 
> PCIE cards fails.
> 
More than seconded, I could only find READ values on the Micron site which
makes me very suspicious, as the journal's main role is to be able to
WRITE as fast as possible. Also all journals combined ought to be faster
than your final storage. 
Lastly there was no endurance data on the Micron site either and with ALL
your writes having to through those devices I'd be dead scared to deploy
them.

I'd spend that money on the case and controllers as mentioned above and
better storage SSDs.

I was going to pipe up about the Samsungs, but Mark Kirkwood did beat me
to it.
Unless you can be 100% certain that your workload per storage SSD
doesn't exceed 40GB/day I'd stay very clear of them.

Christian

> The only other thing I would be careful about is making sure that your 
> SSDs are good about dealing with power failure during writes.  Not all 
> SSDs behave as you would expect.
> 
> >
> > Thanks for your guys assistance with this.
> 
> np, good luck!
> 
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux