Experience with 5k RPM/archive HDDs

wido@xxxxxxxx (Wido den Hollander) · Sun, 19 Feb 2017 14:54:41 +0100 (CET)

> Op 18 februari 2017 om 17:03 schreef rick stehno <rs350z at me.com>:
> 
> 
> I work for Seagate and have done over a hundred of tests using SMR 8TB disks in a cluster. It all depends on what your access is if SMR hdd would be the best choice. Remember SMR hdd don't perform well doing random writes, but are excellent for reads and sequential writes. 
> I have many tests where I added a SSD or PCIe flash card to place the journals on these devices and SMR performed better than a typical CMR disk and overall cheaper than using all CMR hdd. You can also use some type of caching like Ceph Cache Tier or other caching with very good results.
> By placing the journals on flash or adopt some type of caching you are eliminating the double writes to the SMR hdd and performance should be fine. I have test results if you would like to see them.

I am really keen on seeing those numbers. The blogpost ( https://blog.widodh.nl/2017/02/do-not-use-smr-disks-with-ceph/ ) I wrote is based on two occasions where people bought 6TB and 8TB Seagate SMR disks and used them in Ceph.

One use-case was with a application which would write natively to RADOS and the other with CephFS.

In both occasions the Journals were on SSD, but the backing disk would just be saturated very easily. Ceph still does Random Writes on the disk for updating things like PGLogs and such, writing new OSDMaps, etc.

A sequential large write into Ceph might be splitted up by either CephFS or RBD into smaller writes to various RADOS objects.

I haven't seen a use-case where SMR disks perform 'OK' at all with Ceph. That's why my advise is still to stay away from those disks for Ceph.

In both cases my customers had to spent a lot of money on buying new disks to make it work. The first case was actually somebody who bought 1000 SMR disks and then found out they didn't work with Ceph.

Wido 

> 
> Rick 
> Sent from my iPhone, please excuse any typing errors.
> 
> > On Feb 17, 2017, at 8:49 PM, Mike Miller <millermike287 at gmail.com> wrote:
> > 
> > Hi,
> > 
> > don't go there, we tried this with SMR drives, which will slow down to somewhere around 2-3 IOPS during backfilling/recovery and that renders the cluster useless for client IO. Things might change in the future, but for now, I would strongly recommend against SMR.
> > 
> > Go for normal SATA drives with only slightly higher price/capacity ratios.
> > 
> > - mike
> > 
> >> On 2/3/17 2:46 PM, Stillwell, Bryan J wrote:
> >> On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
> >> <ceph-users-bounces at lists.ceph.com on behalf of wido at 42on.com> wrote:
> >>> 
> >>>> Op 3 februari 2017 om 11:03 schreef Maxime Guyot
> >>>> <Maxime.Guyot at elits.com>:
> >>>> 
> >>>> 
> >>>> Hi,
> >>>> 
> >>>> Interesting feedback!
> >>>> 
> >>>>  > In my opinion the SMR can be used exclusively for the RGW.
> >>>>  > Unless it's something like a backup/archive cluster or pool with
> >>>> little to none concurrent R/W access, you're likely to run out of IOPS
> >>>> (again) long before filling these monsters up.
> >>>> 
> >>>> That?s exactly the use case I am considering those archive HDDs for:
> >>>> something like AWS Glacier, a form of offsite backup probably via
> >>>> radosgw. The classic Seagate enterprise class HDD provide ?too much?
> >>>> performance for this use case, I could live with 1?4 of the performance
> >>>> for that price point.
> >>>> 
> >>> 
> >>> If you go down that route I suggest that you make a mixed cluster for RGW.
> >>> 
> >>> A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
> >>> PM863 or a Intel DC series.
> >>> 
> >>> All pools by default should go to those OSDs.
> >>> 
> >>> Only the RGW buckets data pool should go to the big SMR drives. However,
> >>> again, expect very, very low performance of those disks.
> >> One of the other concerns you should think about is recovery time when one
> >> of these drives fail.  The more OSDs you have, the less this becomes an
> >> issue, but on a small cluster is might take over a day to fully recover
> >> from an OSD failure.  Which is a decent amount of time to have degraded
> >> PGs.
> >> Bryan
> >> E-MAIL CONFIDENTIALITY NOTICE:
> >> The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users at lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com