Re: PCIe journal benefit for SSD OSDs

Christian Balzer <chibi@xxxxxxx> · Thu, 7 Sep 2017 17:44:55 +0900

Hello,

On Thu, 7 Sep 2017 08:03:31 +0200 Stefan Priebe - Profihost AG wrote:

> Hello,
> Am 07.09.2017 um 03:53 schrieb Christian Balzer:
> > 
> > Hello,
> > 
> > On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote:
> >   
> >> We are planning a Jewel filestore based cluster for a performance
> >> sensitive healthcare client, and the conservative OSD choice is
> >> Samsung SM863A.
> >>  
> > 
> > While I totally see where you're coming from and me having stated that
> > I'll give Luminous and Bluestore some time to mature, I'd also be looking
> > into that if I were being in the planning phase now, with like 3 months
> > before deployment.
> > The inherent performance increase with Bluestore (and having something
> > that hopefully won't need touching/upgrading for a while) shouldn't be
> > ignored.   
> 
> Yes and that's the point where i'm currently as well. Thinking about how
> to design a new cluster based on bluestore.
> 
> > The SSDs are fine, I've been starting to use those recently (though not
> > with Ceph yet) as Intel DC S36xx or 37xx are impossible to get.
> > They're a bit slower in the write IOPS department, but good enough for me.  
> 
> I've never used the Intel DC ones but always the Samsung are the Intel
> really faster? 
I don't have any configuration right now where to directly compare them
(different HW, controllers, kernel and fio versions), but at least on
paper a 200GB DC S3700 (unobtanium) with 32K random 4k IOPS (confirmed in
tests) looks a lot better than the 240GB SM863A with 10K IOPS.

I dug into my archives and for a wheezy (3.16 kernel) system I found
results that had the about DC S3700 with 32K IOPS as per the specs and
for a 845DC EVO 960GB 12K write IOPS, also as expected from the specs.

On newer HW with recent kernels (4.9) and fio (I suspect the later)
things have changed to the point that the same fio command line as in the
old tests now gives me results of over 70K IOPS for both 400GB DC S3710s
and 960GB SM863A, both way higher than the specs.
It seems to be basically CPU/IRQ bound at that point, leading me to
believe that "--direct=1" no longer means the same thing.
Adding "--sync=1" to the fio command things become more sane, but are still
odd and partially higher than expected.

Make of that what you will.

> Have you disabled te FLUSH command for the Samsung ones?
How would one do that?
And since they have supposed full power loss protection, why wouldn't that
be the default?

> They don't skip the command automatically like the Intel do. Sadly the
> Samsung SM863 got more expensive over the last months. They were a lot
> cheaper  in the first month of 2016. May be the 2,5" optane intel ssds
> will change the game.
> 
The Optane offerings right now leave me rather unimpressed at 65K write
IOPS and 290MB write speed for their best (32GB) model. 
Not a fit for filestore journals given the write speed and not so
much for the DB part of Bluestore either.

> >> but was wondering if anyone has seen a positive
> >> impact from also using PCIe journals (e.g. Intel P3700 or even the
> >> older 910 series) in front of such SSDs?
> >>  
> > NVMe journals (or WAL and DB space for Bluestore) are nice and can
> > certainly help, especially if Ceph is tuned accordingly.
> > Avoid non DC NVMes, I doubt you can still get 910s, they are officially
> > EOL.
> > You want to match capabilities and endurances, a DC P3700 800GB would be
> > an OK match for 3-4 SM863a 960GB for example.   
> 
> That's a good point but makes the cluster more expensive. Currently
> while using filestore i use one SSD for journal and data which works fine.
> 
Inline is fine if it fits your use case and the reduction in endurance is
also calculated in and/or compensated for.
I do the same with DC S3610s (very similar to the SM863As) on my
cache-tier nodes. 

> With bluestore we've block, db and wal so we need 3 block devices per
> OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much
> more expensive per host - currently running 10 OSDs / SSDs per Node.
> 
Well, the OP was asking for performance, so price obviously goes up.
If you're running SSD OSDs you can put all 3 on the same device and should
be no worse off than before with filestore. 
Keep in mind that small writes also get "journaled" on the DB part, so
double writes and endurance may not improve depending on your write
patterns.
Something really fast for the WAL would likely help, but I have zero
experience and very few written reports here to base that on.

> Have you already done tests how he performance changes with bluestore
> while putting all 3 block devices on the same ssd?
> 
Nope, and given my test clusters, it's likely going to be a while before I
do anything with Bluestore on SSDs, never mind NVMes (of which I have none
as nothing we do requires them at this point). 

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com