Re: Changing SSD Landscape

Nick Fisk <nick@xxxxxxxxxx> · Thu, 18 May 2017 09:48:12 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dan van der Ster
> Sent: 18 May 2017 09:30
> To: Christian Balzer <chibi@xxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Changing SSD Landscape
> 
> On Thu, May 18, 2017 at 3:11 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> > On Wed, 17 May 2017 18:02:06 -0700 Ben Hines wrote:
> >
> >> Well, ceph journals are of course going away with the imminent bluestore.
> > Not really, in many senses.
> >
> 
> But we should expect far fewer writes to pass through the RocksDB and its WAL, right? So perhaps lower endurance flash will be
> usable.

Depends, I flagged up an issue in Bluestore where client latency writing to spinners was tied to the underlying disks latency. Sage has introduced a new deferred write feature which does a similar double write strategy to Filestore, first into the WAL, where it gets coalesced and then written out to the disk. The deferred writes are tuneable, as in you can say only defer writes up to 128kb....etc. But if you want the same write latency you see in Filestore, then you will encounter increased SSD wear to match it. 

> 
> BTW, you asked about Samsung parts earlier. We are running these SM863's in a block storage cluster:
> 
> Model Family:     Samsung based SSDs
> Device Model:     SAMSUNG MZ7KM240HAGR-0E005
> Firmware Version: GXM1003Q
> 
>   9 Power_On_Hours          0x0032   098   098   000    Old_age
> Always       -       9971
> 177 Wear_Leveling_Count     0x0013   094   094   005    Pre-fail
> Always       -       2195
> 241 Total_LBAs_Written      0x0032   099   099   000    Old_age
> Always       -       701300549904
> 242 Total_LBAs_Read         0x0032   099   099   000    Old_age
> Always       -       20421265
> 251 NAND_Writes             0x0032   100   100   000    Old_age
> Always       -       1148921417736
> 
> The problem is that I don't know how to see how many writes have gone through these drives.
> Total_LBAs_Written appears to be bogus -- it's based on time. It matches exactly the 3.6DWPD spec'd for that model:
>   3.6*240GB*9971 hours = 358.95TB
>   701300549904 LBAs * 512Bytes/LBA = 359.06TB
> 
> If we trust Wear_Leveling_Count then we're only dropping 6% in a year
> -- these should be good.
> 
> But maybe they're EOL anyway?
> 
> Cheers, Dan
> 
> >> Are small SSDs still useful for something with Bluestore?
> >>
> > Of course, the WAL and other bits for the rocksdb, read up on it.
> >
> > On top of that is the potential to improve things further with things
> > like bcache.
> >
> >> For speccing out a cluster today that is a many 6+ months away from
> >> being required, which I am going to be doing, i was thinking all-SSD
> >> would be the way to go. (or is all-spinner performant with
> >> Bluestore?) Too early to make that call?
> >>
> > Your call and funeral with regards to all spinners (depending on your
> > needs).
> > Bluestore at the very best of circumstances could double your IOPS,
> > but there are other factors at play and most people who NEED SSD
> > journals now would want something with SSDs in Bluestore as well.
> >
> > If you're planning to actually deploy a (entirely) Bluestore cluster
> > in production with mission critical data before next year, you're a
> > lot braver than me.
> > An early adoption scheme with Bluestore nodes being in their own
> > failure domain (rack) would be the best I could see myself doing in my
> > generic cluster.
> > For the 2 mission critical production clusters, they are (will be)
> > frozen most likely.
> >
> > Christian
> >
> >> -Ben
> >>
> >> On Wed, May 17, 2017 at 5:30 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >>
> >> >
> >> > Hello,
> >> >
> >> > On Wed, 17 May 2017 11:28:17 +0200 Eneko Lacunza wrote:
> >> >
> >> > > Hi Nick,
> >> > >
> >> > > El 17/05/17 a las 11:12, Nick Fisk escribió:
> >> > > > There seems to be a shift in enterprise SSD products to larger
> >> > > > less
> >> > write intensive products and generally costing more than what
> >> > > > the existing P/S 3600/3700 ranges were. For example the new
> >> > > > Intel NVME
> >> > P4600 range seems to start at 2TB. Although I mention Intel
> >> > > > products, this seems to be the general outlook across all
> >> > manufacturers. This presents some problems for acquiring SSD's for
> >> > Ceph
> >> > > > journal/WAL use if your cluster is largely write only and
> >> > > > wouldn't
> >> > benefit from using the extra capacity brought by these SSD's to
> >> > > > use as cache.
> >> > > >
> >> > > > Is anybody in the same situation and is struggling to find good
> >> > > > P3700
> >> > 400G replacements?
> >> > > >
> >> > > We usually build tiny ceph clusters, with 1 gbit network and
> >> > > S3610/S3710 200GB SSDs for journals. We have been experiencing
> >> > > supply problems for those disks lately, although it seems that
> >> > > 400GB disks are available, at least for now.
> >> > >
> >> > This. Very much THIS.
> >> >
> >> > We're trying to get 200 or 400 or even 800GB DC S3710 or S3610s
> >> > here recently with zero success.
> >> > And I'm believing our vendor for a change that it's not their fault.
> >> >
> >> > What seems to be happening (no official confirmation, but it makes
> >> > all the sense in the world to me) is this:
> >> >
> >> > Intel is trying to switch to 3DNAND (like they did with the 3520s),
> >> > but while not having officially EOL'ed the 3(6/7)10s also allowed
> >> > the supply to run dry.
> >> >
> >> > Which of course is not a smart move, because now people are
> >> > massively forced to look for alternatives and if they work unlikely to come back.
> >> >
> >> > I'm looking at oversized Samsungs (base model equivalent to 3610s)
> >> > and am following this thread for other alternatives.
> >> >
> >> > Christian
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com