Re: SMR disks go 100% busy after ~15 minutes

Wido den Hollander <wido@xxxxxxxx> · Mon, 13 Feb 2017 16:04:00 +0100 (CET)

> Op 13 februari 2017 om 15:57 schreef Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx>:
> 
> 
> Then you're not aware of what the SMR disks do. They are just slow for
> all writes, having to read the tracks around, then write it all again
> instead of just the one thing you really wanted to write, due to
> overlap. Then to partially mitigate this, they have some tiny write
> buffer like 8GB flash, and then they use that for the "normal" speed,
> and then when it's full, you crawl (at least this is what the seagate
> ones do). Journals aren't designed to solve that... they help prevent
> the sync load on the osd, but don't somehow make the throughput higher
> (at least not sustained). Even if the journal was perfectly designed for
> performance, it would still do absolutely nothing if it's full and the
> disk is still busy with the old flushing.
> 

Well, that explains indeed. I wasn't aware of the additional buffer inside a SMR disk.

I was asked to look at this system for somebody who bought SMR disks without knowing. As I never touch these disks I found the behavior odd.

The buffer explains it a lot better, wasn't aware that SMR disks have that.

SMR shouldn't be used in Ceph without proper support in Bluestore or XFS aware SMR.

Wido

> 
> On 02/13/17 15:49, Wido den Hollander wrote:
> > Hi,
> >
> > I have a odd case with SMR disks in a Ceph cluster. Before I continue, yes, I am fully aware of SMR and Ceph not playing along well, but there is something happening which I'm not able to fully explain.
> >
> > On a 2x replica cluster with 8TB Seagate SMR disks I can write with about 30MB/sec to each disk using a simple RADOS bench:
> >
> > $ rados bench -t 1
> > $ time rados put 1GB.bin
> >
> > Both ways I found out that the disk can write at that rate.
> >
> > Now, when I start a benchmark with 32 threads it writes fine. Not super fast, but it works.
> >
> > After 15 minutes or so various disks go to 100% busy and just stay there. These OSDs are being marked as down and some even commit suicide due to threads timing out.
> >
> > Stopping the RADOS bench and starting the OSDs again resolves the situation.
> >
> > I am trying to explain what's happening. I'm aware that SMR isn't very good at Random Writes. To partially overcome this there are Intel DC 3510s in there as Journal SSDs.
> >
> > Can anybody explain why this 100% busy pops up after 15 minutes or so?
> >
> > Obviously it would the best if BlueStore had SMR support, but for now it's just Filestore with XFS on there.
> >
> > Wido
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com