Re: Growing an SSD cluster with different disk sizes

Christian Balzer <chibi@xxxxxxx> · Tue, 20 Mar 2018 11:00:22 +0900



Hello,


On Mon, 19 Mar 2018 10:39:02 -0400 Mark Steffen wrote:

> At the moment I'm just testing things out and have no critical data on
> Ceph.  I'm using some Intel DC S3510 drives at the moment; these may not be
> optimal but I'm just trying to do some testing and get my feet with with
> Ceph (since trying it out with 9 OSDs on 2TB spinners about 4 years ago).
>
At 1 DWPD these are most likely _not_ a good fit for anything but the most
read-heavy, write-little type of clusters. 
Get the smart values for them, do some realistic and extensive testing, get
the smart values again and then extrapolate. 

> I had experimented with some of the Crucial M500 240GB drives in a
> relatively high volume LAMP stack server in a RAID5 configuration that ran
> for about 4 years with a fairly heavy load (WordPress sites and all that)
> and no issues.  
Less than 0.2 DWPD, one guesses that was not very write heavy at all or
their warranted endurance is very conservative.
But the later is not something you can bank on, either with regards to
your data safety or getting replacement SSDs of course.

>Other than 3x the number of writes and heavy IO during a
> rebalance, is Ceph "harder" on an SSD than regular RAID would be?  I'm not
> using these in a cache tier, so a lot of the data that gets written to them
> in many cases with "stay" on the drives for some time.
> 
If you're having small writes, they will get "journaled" in the WAL/DB
akin to the journal with filestore, so depending on your use case you may
see up to a 2x amplification.
Of course any write will also cause (at least one) other write to the
RocksDB, but that's more or less on par with plain filesystem journals and
their metadata.

Christian

> Ok, try to keep the amount of storage per TB the same in a failure
> domain/host, but I should aim to be using 1TB drives that are twice as fast
> (to help with IO balance) if I'm mixing drive sizes on the same server (if
> I have high IO load, which TBH I really don't and don't expect to).
> Understood, thank you!
> 
> *Mark Steffen*
> *"Don't believe everything you read on the Internet." -Abraham Lincoln*
> 
> 
> 
> On Mon, Mar 19, 2018 at 7:11 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Sun, 18 Mar 2018 10:59:15 -0400 Mark Steffen wrote:
> >  
> > > Hello,
> > >
> > > I have a Ceph newb question I would appreciate some advice on
> > >
> > > Presently I have 4 hosts in my Ceph cluster, each with 4 480GB eMLC  
> > drives  
> > > in them.  These 4 hosts have 2 more empty slots each.
> > >  
> > A lot of the answers would become clearer and more relevant if you could
> > tell us foremost the exact SSD models (old and new) and the rest of the
> > cluster HW config (controllers, network).
> >
> > When I read 480GB the only DC level SSDs with 3 DWPD are Samsungs, those 3
> > DWPD may or may not be sufficient of course for your use case.
> >
> > I frequently managed to wear out SSDs more during testing and burn-in (i.e.
> > several RAID rebuilds) than in a year of actual usage.
> > A full level data balancing with Ceph (or more than one depending on how
> > you bring those new SSDs and hosts online) is a significant write storm.
> >  
> > > Also, I have some new servers that could also become hosts in the cluster
> > > (I deploy Ceph in a 'hyperconverged' configuration with KVM hypervisor; I
> > > find that I usually tend to run out of disk and RAM before I run out of  
> > CPU  
> > > so why not make the most of it, at least for now).
> > >
> > > The new hosts have only 4 available drive slots each (there are 3 of  
> > them).  
> > >
> > > Am I ok (since this is SSDs and so I'm doubting a major IO bottleneck  
> > that  
> > > I undoubtedly would see with spinners) to just go ahead and add  
> > additional  
> > > two 1TB drives to each of the first 4 hosts, as well as put 4 x 1TB SSDs  
> > in  
> > > the 3 new hosts?  This would give each host a similar amount of storage,
> > > though an unequal amount of OSDs each.
> > >  
> > Some SSDs tend to react much worse to being written to at full speed than
> > others, so tuning Ceph to not use all bandwidth might be still a good idea.
> >  
> > > Since the failure domain is by host, and the OSDs are SSD (with 1TB  
> > drives  
> > > typically being faster than 480GB drives anyway) is this reasonable?  Or  
> > do  
> > > I really need to keep the configuration identical across the board and  
> > just  
> > > add additiona 480GB drives to the new hosts and have it all match?
> > >  
> > Larger SSDs are not always faster (have more parallelism) than smaller
> > ones, thus the question for your models.
> >
> > Having differently sized OSDs is not a problem per se, but needs a full
> > understanding of what is going on.
> > Your larger OSDs will see twice the action, are they
> > a) really twice as fast or
> > b) is your load never going to be an issue anyway?
> >
> > Christian
> >  
> > > I'm also using Luminous/Bluestore if it matters.
> > >
> > > Thanks in advance!
> > >
> > > *Mark Steffen*
> > > *"Don't believe everything you read on the Internet." -Abraham Lincoln*  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> >  


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com