Re: Changing SSD Landscape

Reed Dier <reed.dier@xxxxxxxxxxx> · Thu, 18 May 2017 09:39:19 -0500

BTW, you asked about Samsung parts earlier. We are running these
SM863's in a block storage cluster:

Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7KM240HAGR-0E005
Firmware Version: GXM1003Q

177 Wear_Leveling_Count     0x0013   094   094   005    Pre-fail
Always       -       2195

The problem is that I don't know how to see how many writes have gone
through these drives.

But maybe they're EOL anyway?

Cheers, Dan

I have SM863a 1.9T’s in an all SSD pool.

Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7KM1T9HMJP-00005

The easiest way to read the number of ‘drive writes’ is the WLC/177 attribute. Where ‘Value’ is going to be normalized value of percentage used (out of 100%) counting down, and the ‘raw value’ is going to be your actual Program/Erase Cycles average value, aka your drive writes.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1758
177 Wear_Leveling_Count     0x0013   099   099   005    Pre-fail  Always       -       7

So in my case,  for this drive in question, the average of all the NAND has been fully written 7 times.

The 1.9T SM863 is rated at 12.32 PBW, with a warranty period of 5 years, so ~3.6 DWPD, or ~6,500 drive writes for the total life of the drive.

Now your drive shows 2,195 PE Cycles, which would be about 33% of the total PE cycles its rated for. I’m guessing that some of the NAND may have higher PE cycles than others, and the raw value reported may be the max value, rather than the average.

Intel reports the min/avg/max on their drives using isdct.

$ sudo isdct show -smart ad -intelssd 0

- SMART Attributes PHMD_400AGN -
- AD -
AverageEraseCycles : 256
Description : Wear Leveling Count
ID : AD
MaximumEraseCycles : 327
MinimumEraseCycles : 188
Normalized : 98
Raw : 1099533058236

This is a P3700, one of the oldest in use. So this one has seen ~2% of its life expectancy usage, where some NAND has seen 75% more PE cycles than others.

Would be curious what the raw value for Samsung is reporting, but thats an easy way to gauge drive writes.

Reed

On May 18, 2017, at 3:30 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

On Thu, May 18, 2017 at 3:11 AM, Christian Balzer <chibi@xxxxxxx> wrote:
On Wed, 17 May 2017 18:02:06 -0700 Ben Hines wrote:

Well, ceph journals are of course going away with the imminent bluestore.
Not really, in many senses.

But we should expect far fewer writes to pass through the RocksDB and
its WAL, right? So perhaps lower endurance flash will be usable.

BTW, you asked about Samsung parts earlier. We are running these
SM863's in a block storage cluster:

Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7KM240HAGR-0E005
Firmware Version: GXM1003Q

 9 Power_On_Hours          0x0032   098   098   000    Old_age
Always       -       9971
177 Wear_Leveling_Count     0x0013   094   094   005    Pre-fail
Always       -       2195
241 Total_LBAs_Written      0x0032   099   099   000    Old_age
Always       -       701300549904
242 Total_LBAs_Read         0x0032   099   099   000    Old_age
Always       -       20421265
251 NAND_Writes             0x0032   100   100   000    Old_age
Always       -       1148921417736

The problem is that I don't know how to see how many writes have gone
through these drives.
Total_LBAs_Written appears to be bogus -- it's based on time. It
matches exactly the 3.6DWPD spec'd for that model:
 3.6*240GB*9971 hours = 358.95TB
 701300549904 LBAs * 512Bytes/LBA = 359.06TB

If we trust Wear_Leveling_Count then we're only dropping 6% in a year
-- these should be good.

But maybe they're EOL anyway?

Cheers, Dan

Are small SSDs still useful for something with Bluestore?

Of course, the WAL and other bits for the rocksdb, read up on it.

On top of that is the potential to improve things further with things
like bcache.

For speccing out a cluster today that is a many 6+ months away from being
required, which I am going to be doing, i was thinking all-SSD would be the
way to go. (or is all-spinner performant with Bluestore?) Too early to make
that call?

Your call and funeral with regards to all spinners (depending on your
needs).
Bluestore at the very best of circumstances could double your IOPS, but
there are other factors at play and most people who NEED SSD journals now
would want something with SSDs in Bluestore as well.

If you're planning to actually deploy a (entirely) Bluestore cluster in
production with mission critical data before next year, you're a lot
braver than me.
An early adoption scheme with Bluestore nodes being in their own failure
domain (rack) would be the best I could see myself doing in my generic
cluster.
For the 2 mission critical production clusters, they are (will be) frozen
most likely.

Christian

-Ben

On Wed, May 17, 2017 at 5:30 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Wed, 17 May 2017 11:28:17 +0200 Eneko Lacunza wrote:

Hi Nick,

El 17/05/17 a las 11:12, Nick Fisk escribió:
There seems to be a shift in enterprise SSD products to larger less
write intensive products and generally costing more than what
the existing P/S 3600/3700 ranges were. For example the new Intel NVME
P4600 range seems to start at 2TB. Although I mention Intel
products, this seems to be the general outlook across all
manufacturers. This presents some problems for acquiring SSD's for Ceph
journal/WAL use if your cluster is largely write only and wouldn't
benefit from using the extra capacity brought by these SSD's to
use as cache.

Is anybody in the same situation and is struggling to find good P3700
400G replacements?

We usually build tiny ceph clusters, with 1 gbit network and S3610/S3710
200GB SSDs for journals. We have been experiencing supply problems for
those disks lately, although it seems that 400GB disks are available, at
least for now.

This. Very much THIS.

We're trying to get 200 or 400 or even 800GB DC S3710 or S3610s here
recently with zero success.
And I'm believing our vendor for a change that it's not their fault.

What seems to be happening (no official confirmation, but it makes all the
sense in the world to me) is this:

Intel is trying to switch to 3DNAND (like they did with the 3520s), but
while not having officially EOL'ed the 3(6/7)10s also allowed the supply
to run dry.

Which of course is not a smart move, because now people are massively
forced to look for alternatives and if they work unlikely to come back.

I'm looking at oversized Samsungs (base model equivalent to 3610s) and am
following this thread for other alternatives.

Christian
--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com