> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Oliver Dzombic > Sent: 31 May 2016 12:51 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Fwd: [Ceph-community] Wasting the Storage > capacity when using Ceph based On high-end storage systems > > Hi Nick, > > well as it seems, you have a point. > > I was starting now processes to make 3 of 4 cores 100% busy. > > The %wa was dropping down to 2,5-12% without scrubbing, but also ~ 0% > Idle time. Thanks for testing that out. I was getting worried I had ordered the wrong kit!!! > > While its ~ 20-70 %wa without that 3 of 4 100% busy cores, with ~ 40-50% Idle > time. > > --- > > That means, that %wa does not harm the cpu (considerable). > > I dont dare to start a new srubbing now again at day time. Are you possibly getting poor performance during scrubs due to disk contention rather than cpu? > > So this numbers are missing right now. > > In the very end it means, that you should be fine with your 12x HDD. > > So in the very end, the more cores you have, the GHz in sum you get > > ( 4x4 GHz E3 vs. 12-24x 2,4 GHz E5 ) > > and this way, the more OSD's you can run, without being the CPU the > bottleneck. > > While > frequency == > task performance ( write at most ). Yes, that was my plan to get good write latency, but still be able to scale easily/cheaply by using the E3's. Also keep in mind with too many cores, they will start scaling down their frequency to save power if they are not kept busy. If a process gets assigned to a sleeping clocked down core, it takes a while for it to boost back up. I found this could cause a 10-30% hit in performance. FYI, here is the tests I run last year http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/ I actually found I could get better than 1600 if I ran a couple of CPU stress tests to keep the cores running at turbo speed. I also tested with the journal on a ram disk to eliminate the SSD speed from the equation and managed to get near 2500iops, which is pretty good for QD=1. > > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:info@xxxxxxxxxxxxxxxxx > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > > Am 31.05.2016 um 13:05 schrieb Nick Fisk: > > Hi Oliver, > > > > Thanks for this, very interesting and relevant to me at the moment as > > your two hardware platforms mirror exactly my existing and new cluster. > > > > Just a couple of comments inline. > > > > > >> -----Original Message----- > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > >> Of Oliver Dzombic > >> Sent: 31 May 2016 11:26 > >> To: ceph-users@xxxxxxxxxxxxxx > >> Subject: Re: Fwd: [Ceph-community] Wasting the Storage > >> capacity when using Ceph based On high-end storage systems > >> > >> Hi Nick, > >> > >> we have here running on one node: > >> > >> E3-1225v5 ( 4x 3,3 GHz ), 32 GB RAM ( newceph1 ) > >> > >> looks like this: http://pastebin.com/btVpeJrE > >> > >> > >> and we have here running on another node: > >> > >> > >> 2x E5-2620v3 ( 12x 2,4 GHz + HT units ), 64 GB RAM ( new-ceph2 ) > >> > >> looks like this: http://pastebin.com/S6XYbwzw > >> > >> > >> > >> The corresponding ceph tree > >> > >> looks like this: http://pastebin.com/evRqwNT2 > >> > >> That all running a replication of 2. > >> > >> ----------- > >> > >> So as we can see, we run the same number ( 10 ) of HDDs. > >> Model is 3 TB 7200 RPM's 128 MB Cache on this two nodes. > >> > >> An ceph osd perf looks like: http://pastebin.com/4d0Xik0m > >> > >> > >> What you see right now is normal, everyday, load on a healthy cluster. > >> > >> So now, because we are mean, lets turn on deeb scrubbing in the > >> middle of the day ( to give the people a reason to make a coffee > >> break now at > >> 12:00 CET ). > >> > >> ------------ > >> > >> So now for the E3: http://pastebin.com/ZagKnhBQ > >> > >> And for the E5: http://pastebin.com/2J4zqqNW > >> > >> And again our osd perf: http://pastebin.com/V6pKGp9u > >> > >> > >> --------------- > >> > >> > >> So my conclusion out of that is that the E3 CPU becomes faster > >> overloaded > > ( > >> 4 Cores with load 12/13 ) vs. ( 24 vCores with load 18/19 ) > >> > >> To me, even i can not really meassure it, and even osd perf show's a > >> lower latency at the E3 OSD's compared to the E5 OSD's, i can see > >> with the E3 > > CPU > >> stats, that its frequently running into 0% Idle because the CPU has > >> to > > wait for > >> the HDDs ( %WA ). And because the core has to wait for the hardware, > >> its CPU power will not be used for something else because its in > >> waiting > > state. > > > > Not sure if this was visible in the paste dump, but what is the run > > queue for both systems? When I looked into this a while back, I > > thought load included IOWait in its calculation, but IOWait itself > > didn't stop another thread getting run on the CPU if needed. > > Effectively IOWait = IDLE (if needed). From what I understand the run > > queue on the CPU dictates whether or not there are too many threads > > queuing to run and thus slow performance. So I think your example for > > the E3 shows that there is still around 66% of the CPU available for > > processing. As a test, could you trying running something like "stress" to > consume CPU cycles and see if the IOWait drops? > > > >> > >> So even the E3 CPU get the job done faster, the HDD's are usually too > >> slow and the bottleneck. So the E3 can not take real advantage of his > >> higher Power per core. And because he has a low number of cores, the > >> number of "waiting state" cores becomes fast as big as the total > >> number of cpu > > cores. > > > > I did some testing last year by scaling the CPU frequency and > > measuring write latency. If you are using SSD journals, then I found > > the frequency makes a massive difference to small write IO's. If you > > are doing mainly reads with HDD's, then faster Cores probably won't do > much. > > > >> > >> The result is, that there is an overload of the system, and we are > >> running > > in > >> an evil nightmare of IOPS. > > > > Are you actually seeing problems with the cluster? I would interested > > to hear what you are encountering? > > > >> > >> But again: I can not meassure it really. I can not see which HDD > >> delivers > > which > >> data and how fast. > >> > >> So maybe the E5 is slowing down the whole stuff. Maybe not. > >> > >> But for me, the probability, that a 4 Core System with 0% Idle left > >> at > >> 12/13 systemload is "guilty", is higher than a 24v Core System, with > >> still > > ~ 50% > >> Idletime and a 18/19 systemload. > >> > >> But, of course, i have to admit, because of the 32 GB RAM vs. 64 GB > >> RAM, > > the > >> compare might be more like apple's and orange's. Maybe with similar > >> RAM, the system will perform similar. > > > > I'm sticking 64GB in my E3 servers to be on the safe side. > > > >> > >> But you can judge the stats yourself, and maybe gain some knowledge > >> from it :-) > >> > >> For us, what we will do here next, after jewel is now out, we will > >> build > > up a > >> new cluster with: > >> > >> 2x E5-2620v3, 128 GB RAM, HBA -> JBOD configuration while we will add > >> a SSD cache tier. So right now, i still believe, that with the E3, > >> because > > of the > >> limited number of cores, you are more limited on the maximum numbers > >> of OSD's you can run with it. > > > > If you do need more cores, I think a better solution might be a 8 or > > 10 core single CPU. There seems to be a lot of evidence that sticking > > with a single socket is best for ceph if you can. > > > >> > >> Maybe with your E3, your 12 HDDs ( depending on/especially if you > >> have > >> (SSD) cache tier in between ) will run fine. But i think you are > >> going > > here into > >> an area where, in special conditions ( hardware failure/deeb > >> scrub/... ) > > your > >> storage performance with an E3 will fastly loose so much speed, that > >> your applications will not operate smooth anymore. > >> > >> But again, many factor's are involved, so make ur own picture :-) > >> > >> > >> -- > >> Mit freundlichen Gruessen / Best regards > >> > >> Oliver Dzombic > >> IP-Interactive > >> > >> mailto:info@xxxxxxxxxxxxxxxxx > >> > >> Anschrift: > >> > >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 > >> 63571 Gelnhausen > >> > >> HRB 93402 beim Amtsgericht Hanau > >> Geschäftsführung: Oliver Dzombic > >> > >> Steuer Nr.: 35 236 3622 1 > >> UST ID: DE274086107 > >> > >> > >> Am 31.05.2016 um 09:41 schrieb Nick Fisk: > >>> Hi Oliver, > >>> > >>>> -----Original Message----- > >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>>> Behalf Of Oliver Dzombic > >>>> Sent: 30 May 2016 16:32 > >>>> To: ceph-users@xxxxxxxxxxxxxx > >>>> Subject: Re: Fwd: [Ceph-community] Wasting the Storage > >>>> capacity when using Ceph based On high-end storage systems > >>>> > >>>> Hi, > >>>> > >>>> E3 CPUs have 4 Cores, with HT Unit. So 8 logical Cores. And they > >>>> are not > >>> multi > >>>> CPU. > >>>> > >>>> That means you will be naturally ( fastly ) limited in the number > >>>> of OSD's > >>> you > >>>> can run with that. > >>> > >>> I'm hoping to be able to run 12, do you think that will be a struggle? > >>> > >>>> > >>>> Because no matter how much Ghz it has, the OSD process occupy a cpu > >>>> core for ever. > >>> > >>> I'm not sure I agree with this point. A OSD process is comprised of > >>> 10's of threads, which unless you have pinned the process to a > >>> single core, will be running randomly across all the cores on the > >>> CPU. As far as I'm aware, all these threads are given a 10ms time > >>> slice and then scheduled to run on the next available core. A 4x4Ghz > >>> CPU will run all these threads faster than a 8x2Ghz CPU, this is > >>> where the latency > >> advantages are seen. > >>> > >>> If you get to the point you have 100's of threads all demanding CPU > >>> time, a 4x4Ghz CPU will be roughly the same speed as a 8x2Ghz CPU. > >>> Yes there are half the cores available, but each core completes its > >>> work in > > half > >> the time. > >>> There may be some advantages with ever increasing thread counts, but > >>> there is also disadvantages with memory/IO access over the inter CPU > >>> link in the case of dual sockets. > >>> > >>>> Not for 100%, but still enough, to ruin ur day, if you have 8 > >>>> logical cores and 12 disks ( in scrubbing/backfilling/high load ). > >>> > >>> I did some testing with a 12 Core 2Ghz Xeon E5 (2x6) by disabling 8 > >>> cores and performance was sufficient. I know E3 and E5 are different > >>> CPU families, but hopefully this was a good enough test. > >>> > >>>> > >>>> So all single Core CPU's are just good for a very limited amount of > > OSD's. > >>> > >>>> -- > >>>> Mit freundlichen Gruessen / Best regards > >>>> > >>>> Oliver Dzombic > >>>> IP-Interactive > >>>> > >>>> mailto:info@xxxxxxxxxxxxxxxxx > >>>> > >>>> Anschrift: > >>>> > >>>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 > >>>> 63571 Gelnhausen > >>>> > >>>> HRB 93402 beim Amtsgericht Hanau > >>>> Geschäftsführung: Oliver Dzombic > >>>> > >>>> Steuer Nr.: 35 236 3622 1 > >>>> UST ID: DE274086107 > >>>> > >>>> > >>>> Am 30.05.2016 um 17:13 schrieb Christian Balzer: > >>>>> > >>>>> Hello, > >>>>> > >>>>> On Mon, 30 May 2016 09:40:11 +0100 Nick Fisk wrote: > >>>>> > >>>>>> The other option is to scale out rather than scale up. I'm > >>>>>> currently building nodes based on a fast Xeon E3 with 12 Drives > >>>>>> in 1U. The MB/CPU is very attractively priced and the higher > >>>>>> clock gives you much lower write latency if that is important. > >>>>>> The density is slightly lower, but I guess you gain an advantage > >>>>>> in more granularity > >>> of the > >>>> cluster. > >>>>>> > >>>>> Most definitely, granularity and number of OSDs (up to a point, > >>>>> mind > >>>>> ya) is a good thing [TM]. > >>>>> > >>>>> I was citing the designs I did to basically counter the "not dense > >>> enough" > >>>>> argument. > >>>>> > >>>>> Ultimately with Ceph (unless you throw lots of money and brain > >>>>> cells at it), the less dense, the better it will perform. > >>>>> > >>>>> Christian > >>>>>>> -----Original Message----- > >>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>>>>>> Behalf Of Jack Makenz > >>>>>>> Sent: 30 May 2016 08:40 > >>>>>>> To: Christian Balzer <chibi@xxxxxxx> > >>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx > >>>>>>> Subject: Re: Fwd: [Ceph-community] Wasting the > >>>>>>> Storage capacity when using Ceph based On high-end storage > >> systems > >>>>>>> > >>>>>>> Thanks Christian, and all of ceph users > >>>>>>> > >>>>>>> Your guidance was very helpful, appreciate ! > >>>>>>> > >>>>>>> Regards > >>>>>>> Jack Makenz > >>>>>>> > >>>>>>> On Mon, May 30, 2016 at 11:08 AM, Christian Balzer > >>>>>>> <chibi@xxxxxxx> > >>>>>>> wrote: > >>>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> you may want to read up on the various high-density node threads > >>>>>>> and conversations here. > >>>>>>> > >>>>>>> You most certainly do NOT need high end-storage systems to > >>>>>>> create multi-petabyte storage systems with Ceph. > >>>>>>> > >>>>>>> If you were to use these chassis as a basis: > >>>>>>> > >>>>>>> https://www.supermicro.com.tw/products/system/4U/6048/SSG- > >>>> 6048R- > >>>>>>> E1CR60N.cfm > >>>>>>> [We (and surely others) urged Supermicro to provide a design > >>>>>>> like this] > >>>>>>> > >>>>>>> And fill them with 6TB HDDs, configure them as 5x 12HDD RAID6s, > >>>>>>> set your replication to 2 in Ceph, you will wind up with VERY > >>>>>>> reliable, resilient 1.2PB per rack (32U, leaving space for other > >>>>>>> bits and not melting the PDUs). > >>>>>>> Add fast SSDs or NVMes to this case for journals and you have > >>>>>>> decently performing mass storage. > >>>>>>> > >>>>>>> Need more IOPS for really hot data? > >>>>>>> Add a cache tier or dedicated SSD pools for special > needs/customers. > >>>>>>> > >>>>>>> Alternatively, do "classic" Ceph with 3x replication or EC > >>>>>>> coding, but in either case (even more so with EC) you will need > >>>>>>> the most firebreathing CPUs available, so compared to the above > >>>>>>> design it may be a zero sum game cost wise, if not performance > wise as well. > >>>>>>> This leaves you with 960TB in the same space when doing 3x > >>> replication. > >>>>>>> > >>>>>>> A middle of the road approach would be to use RAID1 or 10 based > >>>>>>> OSDs to bring down the computational needs in exchange for > >>>>>>> higher storage costs (effective 4x replication). > >>>>>>> This only gives you 720TB, alas it will be easier (and cheaper > >>>>>>> CPU cost > >>>>>>> wise) to achieve peak performance with this approach compared to > >>>>>>> the one above with 60 OSDs per node. > >>>>>>> > >>>>>>> Lastly, I give you this (and not being a fan of Fujitsu, mind): > >>>>>>> > >> http://www.fujitsu.com/global/products/computing/storage/eternus- > >>>> cd/ > >>>>>>> > >>>>>>> Christian > >>>>>>> > >>>>>>> On Mon, 30 May 2016 10:25:35 +0430 Jack Makenz wrote: > >>>>>>> > >>>>>>>> Forwarded conversation > >>>>>>>> Subject: Wasting the Storage capacity when using Ceph based On > >>>>>>>> high-end storage systems > >>>>>>>> ------------------------ > >>>>>>>> > >>>>>>>> From: *Jack Makenz* <jack.makenz@xxxxxxxxx> > >>>>>>>> Date: Sun, May 29, 2016 at 6:52 PM > >>>>>>>> To: ceph-community@xxxxxxxxxxxxxx > >>>>>>>> > >>>>>>>> > >>>>>>>> Hello All, > >>>>>>>> There are some serious problem about ceph that may waste > >>>>>>>> storage > >>>>>>> capacity > >>>>>>>> when using high-end storage system(Hitachi, IBM, EMC, HP ,...) > >>>>>>>> as back-end for OSD hosts. > >>>>>>>> > >>>>>>>> Imagine in the real cloud we need *n Petabytes* of storage > >>>>>>>> capacity that commodity hardware's hard disks or OSD server's > >>>>>>>> hard disks can't provide this amount of storage capacity. thus > >>>>>>>> we have to use storage systems as back-end for OSD hosts(to > >>>>>>>> implement OSD > >>>> daemons ). > >>>>>>>> > >>>>>>>> But because almost all of these storage systems ( Regardless of > >>>>>>>> their > >>>>>>>> brand) use Raid technology and also ceph replicate at least two > >>>>>>>> copy of each Object, lot's amount of storage capacity waste. > >>>>>>>> > >>>>>>>> So is there any solution to solve this problem/misunderstand ? > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> Jack Makenz > >>>>>>>> > >>>>>>>> ---------- > >>>>>>>> From: *Nate Curry* <curry@xxxxxxxxxxxxx> > >>>>>>>> Date: Mon, May 30, 2016 at 5:50 AM > >>>>>>>> To: Jack Makenz <jack.makenz@xxxxxxxxx> > >>>>>>>> Cc: Unknown <ceph-community@xxxxxxxxxxxxxx> > >>>>>>>> > >>>>>>>> > >>>>>>>> I think that purpose of ceph is to get away from having to rely > >>>>>>>> on high end storage systems and to be provide the capacity to > >>>>>>>> utilize multiple less expensive servers as the storage system. > >>>>>>>> > >>>>>>>> That being said you should still be able to use the high end > >>>>>>>> storage systems with or without RAID enabled. You could do > >>>>>>>> away with RAID altogether and let Ceph handle the redundancy or > >>>>>>>> you can have LUNs assigned to hosts be put into use as OSDs. > >>>>>>>> You could make it work however but to get the most out of your > >>>>>>>> storage with Ceph I think a non-RAID configuration would be best. > >>>>>>>> > >>>>>>>> Nate Curry > >>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Ceph-community mailing list > >>>>>>>>> Ceph-community@xxxxxxxxxxxxxx > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com > >>>>>>>>> > >>>>>>>>> > >>>>>>>> ---------- > >>>>>>>> From: *Doug Dressler* <darbymorrison@xxxxxxxxx> > >>>>>>>> Date: Mon, May 30, 2016 at 6:02 AM > >>>>>>>> To: Nate Curry <curry@xxxxxxxxxxxxx> > >>>>>>>> Cc: Jack Makenz <jack.makenz@xxxxxxxxx>, Unknown < > >>>>>>>> ceph-community@xxxxxxxxxxxxxx> > >>>>>>>> > >>>>>>>> > >>>>>>>> For non-technical reasons I had to run ceph initially using SAN > >>>>>>>> disks. > >>>>>>>> > >>>>>>>> Lesson learned: > >>>>>>>> > >>>>>>>> Make sure deduplication is disabled on the SAN :-) > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ---------- > >>>>>>>> From: *Jack Makenz* <jack.makenz@xxxxxxxxx> > >>>>>>>> Date: Mon, May 30, 2016 at 9:05 AM > >>>>>>>> To: Nate Curry <curry@xxxxxxxxxxxxx>, ceph- > >>>> community@xxxxxxxxxxxxxx > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks Nate, > >>>>>>>> But as i mentioned before , providing petabytes of storage > >>>>>>>> capacity on commodity hardware or enterprise servers is almost > >>>>>>>> impossible, of course that it's possible by installing hundreds > >>>>>>>> of servers with > >>>>>>>> 3 terabytes hard disks, but this solution waste data center > >>>>>>>> raise floor, power consumption and also *money* :) > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Christian Balzer Network/Systems Engineer > >>>>>>> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >>>>>>> http://www.gol.com/ > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com