Hello, On Wed, 13 Aug 2014 14:55:29 +0100 James Eckersall wrote: > Hi Christian, > > Most of our backups are rsync or robocopy (windows), so they are > incremental file-based backups. > There will be a high level of parallelism as the backups run mostly > overnight with similar start times. In that case more memory will also be helpful, hopefully preventing a lot of trashing by keeping most if not all hot objects in the pagecache of storage node. > So far I've seen high iowait on the samba head we are using, but low osd > resource usage, suggesting that the bottleneck is within samba (ad > lookups most likely). > I'll be able to test it better when I've built some iscsi heads. > > Looking at the hardware configuration guide, it suggests 1GHz per core, > so maybe we would be okay if we had two hex core procs (with trusty > hyperthreading) per storage server. > According to the guide I shouldn't be seeing something like this (that's the aforementioned storage node with 4 journal SSDs and 8 actual OSDs with 8 3.1GHz cores): --- ATOP - ceph-01 2014/08/14 14:57:46 f----- 5s elapsed PRC | sys 8.31s | user 26.89s | #proc 214 | #tslpi 1091 | #tslpu 10 | #zombie 0 | #exit 0 | CPU | sys 142% | user 533% | irq 23% | idle 77% | wait 26% | curf 3.00GHz | curscal 96% | cpu | sys 20% | user 50% | irq 22% | idle 2% | cpu000 w 5% | curf 3.10GHz | curscal 100% | cpu | sys 17% | user 70% | irq 0% | idle 10% | cpu004 w 3% | curf 2.70GHz | curscal 87% | cpu | sys 17% | user 69% | irq 0% | idle 11% | cpu006 w 2% | curf 3.10GHz | curscal 100% | cpu | sys 19% | user 68% | irq 0% | idle 8% | cpu001 w 6% | curf 2.70GHz | curscal 87% | cpu | sys 17% | user 69% | irq 0% | idle 11% | cpu005 w 2% | curf 3.10GHz | curscal 100% | cpu | sys 17% | user 69% | irq 0% | idle 12% | cpu002 w 2% | curf 3.10GHz | curscal 100% | cpu | sys 17% | user 69% | irq 0% | idle 10% | cpu003 w 4% | curf 3.10GHz | curscal 100% | cpu | sys 17% | user 67% | irq 0% | idle 12% | cpu007 w 3% | curf 3.10GHz | curscal 100% | [snip] PID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 3113 root root 102 1.15s 3.88s 3156K 6996K 0K 63980K -- - S 0 102% ceph-osd 4706 root root 114 1.12s 3.85s 1060K 32K 0K 61956K -- - S 7 101% ceph-osd 5056 root root 106 1.01s 3.69s 1068K 8K 0K 60720K -- - S 7 95% ceph-osd 3846 root root 102 0.95s 3.53s 1056K 16K 0K 56220K -- - S 2 91% ceph-osd 3353 root root 106 1.10s 3.20s 1068K 0K 0K 53640K -- - S 2 87% ceph-osd 3597 root root 106 0.78s 3.35s -4900K -1796K 0K 59912K -- - S 7 84% ceph-osd 4390 root root 114 0.77s 2.71s 1076K 28K 0K 43796K -- - S 2 71% ceph-osd 4094 root root 110 0.74s 2.63s 1060K 12K 0K 41996K -- - S 2 68% ceph-osd --- In short, only 28% of 800% total is spent waiting for I/O, 7 cores are gone for Ceph and assorted system task, purely for computational needs (as also witnessed by the CPUs being ramped to full speed). This happens when I do a: "fio --size=800M --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=64" from a VM to a rbd image. FYI, the HDDs at that time are at 90-100% utilization, the SSDs at about 20%. And funnily enough, the IOPS result of that fio run is about 6500, which is very close to the "Ceph write IOPS barrier per OSD" of about 800 that I previously saw with another cluster and mentioned below. You will want to analyze your IOPS needs and load behavior very carefully. If it's something like the load above, several hundred IOPS per OSD, your likely going to run out of steam for 48 OSDs, even with high-end CPUs. If it's somewhere in the middle of the road, a second CPU (and more memory) will do the trick. Poor journal SSDs, though. ^o^ If on the other hand you can get away with about 6400 write IOPS for your current four node setup, I'd recommend keeping the single CPU, add more memory, do 4 11 disk RAID6 per node(4 hotspare HDDs), thus 4 OSDs per node and set your replication to 2 instead of 3. > I think we are not completely averse to using different hardware, but > really not wanting to waste the 4 chassis we already have as none of this > kit is throwaway money. > Plus I don't like the idea of mixing different hardware across OSD nodes. > Totally agreed on that, vastly differing hardware, especially when it comes to performance, makes things ugly. Christian > J > > On 13 August 2014 14:06, Christian Balzer <chibi at gol.com> wrote: > > > On Wed, 13 Aug 2014 12:47:22 +0100 James Eckersall wrote: > > > > > Hi Christian, > > > > > > We're actually using the following chassis: > > > http://rnt.de/en/bf_xxlarge.html > > > > > Ah yes, one of the Blazeback heritage. > > But rather more well designed and thought through than most of them. > > > > Using the motherboard SATA3 controller for the journal SSDs may be > > advantageous, something to try out with a new/spare machine. > > > > > So yes there are SAS expanders. There are 4 expanders, one is used > > > for the SSD's and the other three are for the SATA drives. > > > The 4 SSD's for the OSD's are mounted at the back of the chassis, > > > along with the OS SSD's. > > > We're currently planning to buy a couple more servers with 3700's, > > > but now we're debating whether these chassis are actually right for > > > us. The density is pretty nice with 48x3.5" in 4U, but I think CPU > > > spec falls short. > > When going with classic Ceph, yes. > > > > > We can spec them up to dual cpu's, but I'm not sure even that would > > > be enough for 48 OSD's. > > When looking at: > > > > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf > > > > it is possible. > > Especially if there are long, sequential writes. > > > > > So far I haven't been really taxing these storage servers - the > > > space is being presented via samba and I think there is a big > > > bottleneck there, so we're planning to move to iscsi instead. > > > We have over 200 servers backing up mostly web content (millions of > > > small files). > > > > > So you're doing more of a rsync, copy operation than using an > > actual backup software like bacula? > > Having 200 servers scribble individual files , potentially with high > > levels of parallelism is another story altogether compared to a few > > bacula streams. > > > > Christian > > > > > > > J > > > > > > > > > On 13 August 2014 10:28, Christian Balzer <chibi at gol.com> wrote: > > > > > > > > > > > Hello, > > > > > > > > On Wed, 13 Aug 2014 09:15:34 +0100 James Eckersall wrote: > > > > > > > > > Hi, > > > > > > > > > > I'm looking for some advice on my ceph cluster. > > > > > > > > > > The current setup is as follows: > > > > > > > > > > 3 mon servers > > > > > > > > > > 4 storage servers with the following spec: > > > > > > > > > > 1x Intel Xeon E5-2640 @2.50GHz 6 core (12 with hyperthreading). > > > > > 64GB DDR3 RAM > > > > > 2x SSDSC2BB080G4 for OS > > > > > > > > > > LSI MegaRAID 9260-16i with the following drives: > > > > 24 drives on a 16 port controller? > > > > I suppose your chassis backplanes are using port expanders then? > > > > How is this all connected up? > > > > It would be very beneficial if the journal SSDs had their own > > > > controller or at least full bandwidth paths. > > > > > > > > > 4 x SSDSC2CW240A3 SSD for OSD journals (5 OSD journals per SSD) > > > > People here will comment on the fact that Intel 520s are not power > > > > failure safe. > > > > I'll add to that that depending on the amount of data you're going > > > > to write to that cluster during its lifetime they might not be > > > > cheaper than DC D3700s either. > > > > You will definitely want to keep an eye on the SMART output of > > > > those, when the Media_Wearout_Indicator reaches 0 they will > > > > supposedly totally brick themselves, whereas the DC models will > > > > "just" go into R/O mode. > > > > > > > > > 20 x Seagate ST4000NM0023 (3.5" 4TB SATA) > > > > > > > > > > > > > > > The storage servers are 4U with 48 x 3.5" drive bays, which > > > > > currently only contain 20 drives. > > > > > > > > > Where are the journal SSDs then? The OS drives I can see being > > > > internal (or next to the PSU as with some Supermicro cases). > > > > > > > > > I'm looking for the best way to populate these chassis more. > > > > > > > > > > From what I've read about ceph requirements, I might not have > > > > > the CPU power to add another 24 OSD's to each chassis, so I've > > > > > been considering whether to RAID6 the OSD drives instead. > > > > > > > > > You would want to add just 20 OSDs and 4 more journal SSDs. ^o^ > > > > And yes, depending on your workload you would be pushing the > > > > envelope with your current configuration at times already. > > > > For an example, with lots of small writes (4KB) IOs (fio or rados > > > > bench) I can push my latest storage node to nearly exhaust its CPU > > > > resources (and yes that's actual CPU cycles for the OSD processes, > > > > not waiting for IO). > > > > > > > > That node consists of: > > > > 1x Opteron 4386 (3.1GHz, 8 cores) > > > > 32GB RAM > > > > 4x Intel DC S3700 (100GB) on local SATA for journal and OS > > > > 8x TOSHIBA DT01ACA300 (3TB) for OSD filestore > > > > > > > > Of course if writing large blobs like with the default 4MB of rados > > > > bench or things like bonnie++ the load is considerably less. > > > > > > > > > Does anyone have any experience they can share with running > > > > > OSD's on RAID6? > > > > Look at recent threads like "Optimal OSD Configuration for 45 > > > > drives?" and "anti-cephalopod question" or scour older threads by > > > > me. > > > > > > > > > Or can anyone comment on whether the CPU I have will cope with~48 > > > > > OSD's? > > > > > > > > > Even with "normal" load I'd be worried putting 40 OSDs on that poor > > > > CPU. When OSDs can't keep up with hearbeats from the MONs and other > > > > OSDs things go to hell in a hand basket very quickly. > > > > > > > > > This ceph cluster is being used for backups (windows and linux > > > > > servers), so I'm not looking for "out of this world" speed, but > > > > > obviously I don't want a snail either. > > > > > > > > > Well, read the above threads, but your use case looks very well > > > > suited for RAID6 backed OSDs. > > > > > > > > Something like 4 RAID6 with 10 HDDs and 4 global hot spares if I > > > > understand your chassis correctly. One journal SSD per OSD. > > > > > > > > You won't be doing more than 800 write IOPS per OSD, but backups > > > > means long sequential writes in my book and for those it will be > > > > just fine. > > > > > > > > Regards, > > > > > > > > Christian > > > > -- > > > > Christian Balzer Network/Systems Engineer > > > > chibi at gol.com Global OnLine Japan/Fusion Communications > > > > http://www.gol.com/ > > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi at gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/