ceph cluster expansion

chibi@xxxxxxx (Christian Balzer) · Thu, 14 Aug 2014 15:35:54 +0900

Hello,

On Wed, 13 Aug 2014 14:55:29 +0100 James Eckersall wrote:

> Hi Christian,
> 
> Most of our backups are rsync or robocopy (windows), so they are
> incremental file-based backups.
> There will be a high level of parallelism as the backups run mostly
> overnight with similar start times.

In that case more memory will also be helpful, hopefully preventing a lot
of trashing by keeping most if not all hot objects in the pagecache of
storage node.

> So far I've seen high iowait on the samba head we are using, but low osd
> resource usage, suggesting that the bottleneck is within samba (ad
> lookups most likely).
> I'll be able to test it better when I've built some iscsi heads.
> 
> Looking at the hardware configuration guide, it suggests 1GHz per core,
> so maybe we would be okay if we had two hex core procs (with trusty
> hyperthreading) per storage server.
> 
According to the guide I shouldn't be seeing something like this (that's
the aforementioned storage node with 4 journal SSDs and 8 actual OSDs with
8 3.1GHz cores):
---
ATOP - ceph-01                    2014/08/14  14:57:46                    f-----                    5s elapsed
PRC | sys    8.31s | user  26.89s | #proc    214 | #tslpi  1091 | #tslpu    10 | #zombie    0 | #exit      0 |
CPU | sys     142% | user    533% | irq      23% | idle     77% | wait     26% | curf 3.00GHz | curscal  96% |
cpu | sys      20% | user     50% | irq      22% | idle      2% | cpu000 w  5% | curf 3.10GHz | curscal 100% |
cpu | sys      17% | user     70% | irq       0% | idle     10% | cpu004 w  3% | curf 2.70GHz | curscal  87% |
cpu | sys      17% | user     69% | irq       0% | idle     11% | cpu006 w  2% | curf 3.10GHz | curscal 100% |
cpu | sys      19% | user     68% | irq       0% | idle      8% | cpu001 w  6% | curf 2.70GHz | curscal  87% |
cpu | sys      17% | user     69% | irq       0% | idle     11% | cpu005 w  2% | curf 3.10GHz | curscal 100% |
cpu | sys      17% | user     69% | irq       0% | idle     12% | cpu002 w  2% | curf 3.10GHz | curscal 100% |
cpu | sys      17% | user     69% | irq       0% | idle     10% | cpu003 w  4% | curf 3.10GHz | curscal 100% |
cpu | sys      17% | user     67% | irq       0% | idle     12% | cpu007 w  3% | curf 3.10GHz | curscal 100% |

[snip]
  PID RUID     EUID      THR  SYSCPU  USRCPU   VGROW  RGROW  RDDSK  WRDSK ST  EXC S CPUNR  CPU CMD
 3113 root     root      102   1.15s   3.88s   3156K  6996K     0K 63980K --    - S     0 102% ceph-osd
 4706 root     root      114   1.12s   3.85s   1060K    32K     0K 61956K --    - S     7 101% ceph-osd
 5056 root     root      106   1.01s   3.69s   1068K     8K     0K 60720K --    - S     7  95% ceph-osd
 3846 root     root      102   0.95s   3.53s   1056K    16K     0K 56220K --    - S     2  91% ceph-osd
 3353 root     root      106   1.10s   3.20s   1068K     0K     0K 53640K --    - S     2  87% ceph-osd
 3597 root     root      106   0.78s   3.35s  -4900K -1796K     0K 59912K --    - S     7  84% ceph-osd
 4390 root     root      114   0.77s   2.71s   1076K    28K     0K 43796K --    - S     2  71% ceph-osd
 4094 root     root      110   0.74s   2.63s   1060K    12K     0K 41996K --    - S     2  68% ceph-osd
---

In short, only 28% of 800% total is spent waiting for I/O, 7 cores are
gone for Ceph and assorted system task, purely for computational needs (as
also witnessed by the CPUs being ramped to full speed). 

This happens when I do a:
"fio --size=800M --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4k --iodepth=64"
from a VM to a rbd image.
FYI, the HDDs at that time are at 90-100% utilization, the SSDs at about
20%. 
And funnily enough, the IOPS result of that fio run is about 6500, which
is very close to the "Ceph write IOPS barrier per OSD" of about 800 that I
previously saw with another cluster and mentioned below.

You will want to analyze your IOPS needs and load behavior very carefully.

If it's something like the load above, several hundred IOPS per OSD, your
likely going to run out of steam for 48 OSDs, even with high-end CPUs.

If it's somewhere in the middle of the road, a second CPU (and more
memory) will do the trick. Poor journal SSDs, though. ^o^

If on the other hand you can get away with about 6400 write IOPS for your
current four node setup, I'd recommend keeping the single CPU, add more
memory, do 4 11 disk RAID6 per node(4 hotspare HDDs), thus 4 OSDs per
node and set your replication to 2 instead of 3.

> I think we are not completely averse to using different hardware, but
> really not wanting to waste the 4 chassis we already have as none of this
> kit is throwaway money.
> Plus I don't like the idea of mixing different hardware across OSD nodes.
> 
Totally agreed on that, vastly differing hardware, especially when it
comes to performance, makes things ugly.

Christian

> J
> 
> On 13 August 2014 14:06, Christian Balzer <chibi at gol.com> wrote:
> 
> > On Wed, 13 Aug 2014 12:47:22 +0100 James Eckersall wrote:
> >
> > > Hi Christian,
> > >
> > > We're actually using the following chassis:
> > > http://rnt.de/en/bf_xxlarge.html
> > >
> > Ah yes, one of the Blazeback heritage.
> > But rather more well designed and thought through than most of them.
> >
> > Using the motherboard SATA3 controller for the journal SSDs may be
> > advantageous, something to try out with a new/spare machine.
> >
> > > So yes there are SAS expanders.  There are 4 expanders, one is used
> > > for the SSD's and the other three are for the SATA drives.
> > > The 4 SSD's for the OSD's are mounted at the back of the chassis,
> > > along with the OS SSD's.
> > > We're currently planning to buy a couple more servers with 3700's,
> > > but now we're debating whether these chassis are actually right for
> > > us. The density is pretty nice with 48x3.5" in 4U, but I think CPU
> > > spec falls short.
> > When going with classic Ceph, yes.
> >
> > > We can spec them up to dual cpu's, but I'm not sure even that would
> > > be enough for 48 OSD's.
> > When looking at:
> >
> > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> >
> > it is possible.
> > Especially if there are long, sequential writes.
> >
> > > So far I haven't been really taxing these storage servers - the
> > > space is being presented via samba and I think there is a big
> > > bottleneck there, so we're planning to move to iscsi instead.
> > > We have over 200 servers backing up mostly web content (millions of
> > > small files).
> > >
> > So you're doing more of a rsync, copy operation than using an
> > actual backup software like bacula?
> > Having 200 servers scribble individual files , potentially with high
> > levels of parallelism is another story altogether compared to a few
> > bacula streams.
> >
> > Christian
> >
> >
> > > J
> > >
> > >
> > > On 13 August 2014 10:28, Christian Balzer <chibi at gol.com> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Wed, 13 Aug 2014 09:15:34 +0100 James Eckersall wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm looking for some advice on my ceph cluster.
> > > > >
> > > > > The current setup is as follows:
> > > > >
> > > > > 3 mon servers
> > > > >
> > > > > 4 storage servers with the following spec:
> > > > >
> > > > > 1x Intel Xeon E5-2640 @2.50GHz 6 core (12 with hyperthreading).
> > > > > 64GB DDR3 RAM
> > > > > 2x SSDSC2BB080G4 for OS
> > > > >
> > > > > LSI MegaRAID 9260-16i with the following drives:
> > > > 24 drives on a 16 port controller?
> > > > I suppose your chassis backplanes are using port expanders then?
> > > > How is this all connected up?
> > > > It would be very beneficial if the journal SSDs had their own
> > > > controller or at least full bandwidth paths.
> > > >
> > > > > 4 x SSDSC2CW240A3 SSD for OSD journals (5 OSD journals per SSD)
> > > > People here will comment on the fact that Intel 520s are not power
> > > > failure safe.
> > > > I'll add to that that depending on the amount of data you're going
> > > > to write to that cluster during its lifetime they might not be
> > > > cheaper than DC D3700s either.
> > > > You will definitely want to keep an eye on the SMART output of
> > > > those, when the Media_Wearout_Indicator reaches 0 they will
> > > > supposedly totally brick themselves, whereas the DC models will
> > > > "just" go into R/O mode.
> > > >
> > > > > 20 x Seagate ST4000NM0023 (3.5" 4TB SATA)
> > > > >
> > > > >
> > > > > The storage servers are 4U with 48 x 3.5" drive bays, which
> > > > > currently only contain 20 drives.
> > > > >
> > > > Where are the journal SSDs then? The OS drives I can see being
> > > > internal (or next to the PSU as with some Supermicro cases).
> > > >
> > > > > I'm looking for the best way to populate these chassis more.
> > > > >
> > > > > From what I've read about ceph requirements, I might not have
> > > > > the CPU power to add another 24 OSD's to each chassis, so I've
> > > > > been considering whether to RAID6 the OSD drives instead.
> > > > >
> > > > You would want to add just 20 OSDs and 4 more journal SSDs. ^o^
> > > > And yes, depending on your workload you would be pushing the
> > > > envelope with your current configuration at times already.
> > > > For an example, with lots of small writes (4KB) IOs (fio or rados
> > > > bench) I can push my latest storage node to nearly exhaust its CPU
> > > > resources (and yes that's actual CPU cycles for the OSD processes,
> > > > not waiting for IO).
> > > >
> > > > That node consists of:
> > > > 1x Opteron 4386 (3.1GHz, 8 cores)
> > > > 32GB RAM
> > > > 4x Intel DC S3700 (100GB) on local SATA for journal and OS
> > > > 8x TOSHIBA DT01ACA300 (3TB) for OSD filestore
> > > >
> > > > Of course if writing large blobs like with the default 4MB of rados
> > > > bench or things like bonnie++ the load is considerably less.
> > > >
> > > > > Does anyone have any experience they can share with running
> > > > > OSD's on RAID6?
> > > > Look at recent threads like "Optimal OSD Configuration for 45
> > > > drives?" and "anti-cephalopod question" or scour older threads by
> > > > me.
> > > >
> > > > > Or can anyone comment on whether the CPU I have will cope with~48
> > > > > OSD's?
> > > > >
> > > > Even with "normal" load I'd be worried putting 40 OSDs on that poor
> > > > CPU. When OSDs can't keep up with hearbeats from the MONs and other
> > > > OSDs things go to hell in a hand basket very quickly.
> > > >
> > > > > This ceph cluster is being used for backups (windows and linux
> > > > > servers), so I'm not looking for "out of this world" speed, but
> > > > > obviously I don't want a snail either.
> > > > >
> > > > Well, read the above threads, but your use case looks very well
> > > > suited for RAID6 backed OSDs.
> > > >
> > > > Something like 4 RAID6 with 10 HDDs and 4 global hot spares if I
> > > > understand your chassis correctly. One journal SSD per OSD.
> > > >
> > > > You won't be doing more than 800 write IOPS per OSD, but backups
> > > > means long sequential writes in my book and for those it will be
> > > > just fine.
> > > >
> > > > Regards,
> > > >
> > > > Christian
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi at gol.com           Global OnLine Japan/Fusion Communications
> > > > http://www.gol.com/
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi at gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/