Re: Fwd: [Ceph-community] Wasting the Storage capacity when using Ceph based On high-end storage systems

Nick Fisk <nick@xxxxxxxxxx> · Tue, 31 May 2016 15:24:19 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Oliver Dzombic
> Sent: 31 May 2016 12:51
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Fwd: [Ceph-community] Wasting the Storage
> capacity when using Ceph based On high-end storage systems
> 
> Hi Nick,
> 
> well as it seems, you have a point.
> 
> I was starting now processes to make 3 of 4 cores 100% busy.
> 
> The %wa was dropping down to 2,5-12% without scrubbing, but also ~ 0%
> Idle time.

Thanks for testing that out. I was getting worried I had ordered the wrong
kit!!!

> 
> While its ~ 20-70 %wa without that 3 of 4 100% busy cores, with ~ 40-50%
Idle
> time.
> 
> ---
> 
> That means, that %wa does not harm the cpu (considerable).
> 
> I dont dare to start a new srubbing now again at day time.

Are you possibly getting poor performance during scrubs due to disk
contention rather than cpu?

> 
> So this numbers are missing right now.
> 
> In the very end it means, that you should be fine with your 12x HDD.
> 
> So in the very end, the more cores you have, the GHz in sum you get
> 
> ( 4x4 GHz E3 vs. 12-24x 2,4 GHz E5 )
> 
> and this way, the more OSD's you can run, without being the CPU the
> bottleneck.
> 
> While > frequency == > task performance ( write at most ).

Yes, that was my plan to get good write latency, but still be able to scale
easily/cheaply by using the E3's. Also keep in mind with too many cores,
they will start scaling down their frequency to save power if they are not
kept busy. If a process gets assigned to a sleeping clocked down core, it
takes a while for it to boost back up. I found this could cause a 10-30% hit
in performance. 

FYI, here is the tests I run last year
http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/

I actually found I could get better than 1600 if I ran a couple of CPU
stress tests to keep the cores running at turbo speed. I also tested with
the journal on a ram disk to eliminate the SSD speed from the equation and
managed to get near 2500iops, which is pretty good for QD=1.

> 
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:info@xxxxxxxxxxxxxxxxx
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 31.05.2016 um 13:05 schrieb Nick Fisk:
> > Hi Oliver,
> >
> > Thanks for this, very interesting and relevant to me at the moment as
> > your two hardware platforms mirror exactly my existing and new cluster.
> >
> > Just a couple of comments inline.
> >
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Oliver Dzombic
> >> Sent: 31 May 2016 11:26
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  Fwd: [Ceph-community] Wasting the Storage
> >> capacity when using Ceph based On high-end storage systems
> >>
> >> Hi Nick,
> >>
> >> we have here running on one node:
> >>
> >> E3-1225v5 ( 4x 3,3 GHz ), 32 GB RAM ( newceph1 )
> >>
> >> looks like this: http://pastebin.com/btVpeJrE
> >>
> >>
> >> and we have here running on another node:
> >>
> >>
> >> 2x E5-2620v3 ( 12x 2,4 GHz + HT units ), 64 GB RAM ( new-ceph2 )
> >>
> >> looks like this: http://pastebin.com/S6XYbwzw
> >>
> >>
> >>
> >> The corresponding ceph tree
> >>
> >> looks like this: http://pastebin.com/evRqwNT2
> >>
> >> That all running a replication of 2.
> >>
> >> -----------
> >>
> >> So as we can see, we run the same number ( 10 ) of HDDs.
> >> Model is 3 TB 7200 RPM's 128 MB Cache on this two nodes.
> >>
> >> An ceph osd perf looks like: http://pastebin.com/4d0Xik0m
> >>
> >>
> >> What you see right now is normal, everyday, load on a healthy cluster.
> >>
> >> So now, because we are mean, lets turn on deeb scrubbing in the
> >> middle of the day ( to give the people a reason to make a coffee
> >> break now at
> >> 12:00 CET ).
> >>
> >> ------------
> >>
> >> So now for the E3: http://pastebin.com/ZagKnhBQ
> >>
> >> And for the E5: http://pastebin.com/2J4zqqNW
> >>
> >> And again our osd perf: http://pastebin.com/V6pKGp9u
> >>
> >>
> >> ---------------
> >>
> >>
> >> So my conclusion out of that is that the E3 CPU becomes faster
> >> overloaded
> > (
> >> 4 Cores with load 12/13 ) vs. ( 24 vCores with load 18/19 )
> >>
> >> To me, even i can not really meassure it, and even osd perf show's a
> >> lower latency at the E3 OSD's compared to the E5 OSD's, i can see
> >> with the E3
> > CPU
> >> stats, that its frequently running into 0% Idle because the CPU has
> >> to
> > wait for
> >> the HDDs ( %WA ). And because the core has to wait for the hardware,
> >> its CPU power will not be used for something else because its in
> >> waiting
> > state.
> >
> > Not sure if this was visible in the paste dump, but what is the run
> > queue for both systems? When I looked into this a while back, I
> > thought load included IOWait in its calculation, but IOWait itself
> > didn't stop another thread getting run on the CPU if needed.
> > Effectively IOWait = IDLE (if needed). From what I understand the run
> > queue on the CPU dictates whether or not there are too many threads
> > queuing to run and thus slow performance. So I think your example for
> > the E3 shows that there is still around 66% of the CPU available for
> > processing. As a test, could you trying running something like "stress"
to
> consume CPU cycles and see if the IOWait drops?
> >
> >>
> >> So even the E3 CPU get the job done faster, the HDD's are usually too
> >> slow and the bottleneck. So the E3 can not take real advantage of his
> >> higher Power per core. And because he has a low number of cores, the
> >> number of "waiting state" cores becomes fast as big as the total
> >> number of cpu
> > cores.
> >
> > I did some testing last year by scaling the CPU frequency and
> > measuring write latency. If you are using SSD journals, then I found
> > the frequency makes a massive difference to small write IO's. If you
> > are doing mainly reads with HDD's, then faster Cores probably won't do
> much.
> >
> >>
> >> The result is, that there is an overload of the system, and we are
> >> running
> > in
> >> an evil nightmare of IOPS.
> >
> > Are you actually seeing problems with the cluster? I would interested
> > to hear what you are encountering?
> >
> >>
> >> But again: I can not meassure it really. I can not see which HDD
> >> delivers
> > which
> >> data and how fast.
> >>
> >> So maybe the E5 is slowing down the whole stuff. Maybe not.
> >>
> >> But for me, the probability, that a 4 Core System with 0% Idle left
> >> at
> >> 12/13 systemload is "guilty", is higher than a 24v Core System, with
> >> still
> > ~ 50%
> >> Idletime and a 18/19 systemload.
> >>
> >> But, of course, i have to admit, because of the 32 GB RAM vs. 64 GB
> >> RAM,
> > the
> >> compare might be more like apple's and orange's. Maybe with similar
> >> RAM, the system will perform similar.
> >
> > I'm sticking 64GB in my E3 servers to be on the safe side.
> >
> >>
> >> But you can judge the stats yourself, and maybe gain some knowledge
> >> from it :-)
> >>
> >> For us, what we will do here next, after jewel is now out, we will
> >> build
> > up a
> >> new cluster with:
> >>
> >> 2x E5-2620v3, 128 GB RAM, HBA -> JBOD configuration while we will add
> >> a SSD cache tier. So right now, i still believe, that with the E3,
> >> because
> > of the
> >> limited number of cores, you are more limited on the maximum numbers
> >> of OSD's you can run with it.
> >
> > If you do need more cores, I think a better solution might be a 8 or
> > 10 core single CPU. There seems to be a lot of evidence that sticking
> > with a single socket is best for ceph if you can.
> >
> >>
> >> Maybe with your E3, your 12 HDDs ( depending on/especially if you
> >> have
> >> (SSD) cache tier in between ) will run fine. But i think you are
> >> going
> > here into
> >> an area where, in special conditions ( hardware failure/deeb
> >> scrub/... )
> > your
> >> storage performance with an E3 will fastly loose so much speed, that
> >> your applications will not operate smooth anymore.
> >>
> >> But again, many factor's are involved, so make ur own picture :-)
> >>
> >>
> >> --
> >> Mit freundlichen Gruessen / Best regards
> >>
> >> Oliver Dzombic
> >> IP-Interactive
> >>
> >> mailto:info@xxxxxxxxxxxxxxxxx
> >>
> >> Anschrift:
> >>
> >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> >> 63571 Gelnhausen
> >>
> >> HRB 93402 beim Amtsgericht Hanau
> >> Geschäftsführung: Oliver Dzombic
> >>
> >> Steuer Nr.: 35 236 3622 1
> >> UST ID: DE274086107
> >>
> >>
> >> Am 31.05.2016 um 09:41 schrieb Nick Fisk:
> >>> Hi Oliver,
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>> Behalf Of Oliver Dzombic
> >>>> Sent: 30 May 2016 16:32
> >>>> To: ceph-users@xxxxxxxxxxxxxx
> >>>> Subject: Re:  Fwd: [Ceph-community] Wasting the Storage
> >>>> capacity when using Ceph based On high-end storage systems
> >>>>
> >>>> Hi,
> >>>>
> >>>> E3 CPUs have 4 Cores, with HT Unit. So 8 logical Cores. And they
> >>>> are not
> >>> multi
> >>>> CPU.
> >>>>
> >>>> That means you will be naturally ( fastly ) limited in the number
> >>>> of OSD's
> >>> you
> >>>> can run with that.
> >>>
> >>> I'm hoping to be able to run 12, do you think that will be a struggle?
> >>>
> >>>>
> >>>> Because no matter how much Ghz it has, the OSD process occupy a cpu
> >>>> core for ever.
> >>>
> >>> I'm not sure I agree with this point. A OSD process is comprised of
> >>> 10's of threads, which unless you have pinned the process to a
> >>> single core, will be running randomly across all the cores on the
> >>> CPU. As far as I'm aware, all these threads are given a 10ms time
> >>> slice and then scheduled to run on the next available core. A 4x4Ghz
> >>> CPU will run all these threads faster than a 8x2Ghz CPU, this is
> >>> where the latency
> >> advantages are seen.
> >>>
> >>> If you get to the point you have 100's of threads all demanding CPU
> >>> time, a 4x4Ghz CPU will be roughly the same speed as a 8x2Ghz CPU.
> >>> Yes there are half the cores available, but each core completes its
> >>> work in
> > half
> >> the time.
> >>> There may be some advantages with ever increasing thread counts, but
> >>> there is also disadvantages with memory/IO access over the inter CPU
> >>> link in the case of dual sockets.
> >>>
> >>>> Not for 100%, but still enough, to ruin ur day, if you have 8
> >>>> logical cores and 12 disks ( in scrubbing/backfilling/high load ).
> >>>
> >>> I did some testing with a 12 Core 2Ghz Xeon E5 (2x6) by disabling 8
> >>> cores and performance was sufficient. I know E3 and E5 are different
> >>> CPU families, but hopefully this was a good enough test.
> >>>
> >>>>
> >>>> So all single Core CPU's are just good for a very limited amount of
> > OSD's.
> >>>
> >>>> --
> >>>> Mit freundlichen Gruessen / Best regards
> >>>>
> >>>> Oliver Dzombic
> >>>> IP-Interactive
> >>>>
> >>>> mailto:info@xxxxxxxxxxxxxxxxx
> >>>>
> >>>> Anschrift:
> >>>>
> >>>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> >>>> 63571 Gelnhausen
> >>>>
> >>>> HRB 93402 beim Amtsgericht Hanau
> >>>> Geschäftsführung: Oliver Dzombic
> >>>>
> >>>> Steuer Nr.: 35 236 3622 1
> >>>> UST ID: DE274086107
> >>>>
> >>>>
> >>>> Am 30.05.2016 um 17:13 schrieb Christian Balzer:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> On Mon, 30 May 2016 09:40:11 +0100 Nick Fisk wrote:
> >>>>>
> >>>>>> The other option is to scale out rather than scale up. I'm
> >>>>>> currently building nodes based on a fast Xeon E3 with 12 Drives
> >>>>>> in 1U. The MB/CPU is very attractively priced and the higher
> >>>>>> clock gives you much lower write latency if that is important.
> >>>>>> The density is slightly lower, but I guess you gain an advantage
> >>>>>> in more granularity
> >>> of the
> >>>> cluster.
> >>>>>>
> >>>>> Most definitely, granularity and number of OSDs (up to a point,
> >>>>> mind
> >>>>> ya) is a good thing [TM].
> >>>>>
> >>>>> I was citing the designs I did to basically counter the "not dense
> >>> enough"
> >>>>> argument.
> >>>>>
> >>>>> Ultimately with Ceph (unless you throw lots of money and brain
> >>>>> cells at it), the less dense, the better it will perform.
> >>>>>
> >>>>> Christian
> >>>>>>> -----Original Message-----
> >>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>>>> Behalf Of Jack Makenz
> >>>>>>> Sent: 30 May 2016 08:40
> >>>>>>> To: Christian Balzer <chibi@xxxxxxx>
> >>>>>>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>>>>>> Subject: Re:  Fwd: [Ceph-community] Wasting the
> >>>>>>> Storage capacity when using Ceph based On high-end storage
> >> systems
> >>>>>>>
> >>>>>>> Thanks Christian, and all of ceph users
> >>>>>>>
> >>>>>>> Your guidance was very helpful, appreciate !
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> Jack Makenz
> >>>>>>>
> >>>>>>> On Mon, May 30, 2016 at 11:08 AM, Christian Balzer
> >>>>>>> <chibi@xxxxxxx>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> you may want to read up on the various high-density node threads
> >>>>>>> and conversations here.
> >>>>>>>
> >>>>>>> You most certainly do NOT need high end-storage systems to
> >>>>>>> create multi-petabyte storage systems with Ceph.
> >>>>>>>
> >>>>>>> If you were to use these chassis as a basis:
> >>>>>>>
> >>>>>>> https://www.supermicro.com.tw/products/system/4U/6048/SSG-
> >>>> 6048R-
> >>>>>>> E1CR60N.cfm
> >>>>>>> [We (and surely others) urged Supermicro to provide a design
> >>>>>>> like this]
> >>>>>>>
> >>>>>>> And fill them with 6TB HDDs, configure them as 5x 12HDD RAID6s,
> >>>>>>> set your replication to 2 in Ceph, you will wind up with VERY
> >>>>>>> reliable, resilient 1.2PB per rack (32U, leaving space for other
> >>>>>>> bits and not melting the PDUs).
> >>>>>>> Add fast SSDs or NVMes to this case for journals and you have
> >>>>>>> decently performing mass storage.
> >>>>>>>
> >>>>>>> Need more IOPS for really hot data?
> >>>>>>> Add a cache tier or dedicated SSD pools for special
> needs/customers.
> >>>>>>>
> >>>>>>> Alternatively, do "classic" Ceph with 3x replication or EC
> >>>>>>> coding, but in either case (even more so with EC) you will need
> >>>>>>> the most firebreathing CPUs available, so compared to the above
> >>>>>>> design it may be a zero sum game cost wise, if not performance
> wise as well.
> >>>>>>> This leaves you with 960TB in the same space when doing 3x
> >>> replication.
> >>>>>>>
> >>>>>>> A middle of the road approach would be to use RAID1 or 10 based
> >>>>>>> OSDs to bring down the computational needs in exchange for
> >>>>>>> higher storage costs (effective 4x replication).
> >>>>>>> This only gives you 720TB, alas it will be easier (and cheaper
> >>>>>>> CPU cost
> >>>>>>> wise) to achieve peak performance with this approach compared to
> >>>>>>> the one above with 60 OSDs per node.
> >>>>>>>
> >>>>>>> Lastly, I give you this (and not being a fan of Fujitsu, mind):
> >>>>>>>
> >> http://www.fujitsu.com/global/products/computing/storage/eternus-
> >>>> cd/
> >>>>>>>
> >>>>>>> Christian
> >>>>>>>
> >>>>>>> On Mon, 30 May 2016 10:25:35 +0430 Jack Makenz wrote:
> >>>>>>>
> >>>>>>>> Forwarded conversation
> >>>>>>>> Subject: Wasting the Storage capacity when using Ceph based On
> >>>>>>>> high-end storage systems
> >>>>>>>> ------------------------
> >>>>>>>>
> >>>>>>>> From: *Jack Makenz* <jack.makenz@xxxxxxxxx>
> >>>>>>>> Date: Sun, May 29, 2016 at 6:52 PM
> >>>>>>>> To: ceph-community@xxxxxxxxxxxxxx
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hello All,
> >>>>>>>> There are some serious problem about ceph that may waste
> >>>>>>>> storage
> >>>>>>> capacity
> >>>>>>>> when using high-end storage system(Hitachi, IBM, EMC, HP ,...)
> >>>>>>>> as back-end for OSD hosts.
> >>>>>>>>
> >>>>>>>> Imagine in the real cloud we need  *n Petabytes* of storage
> >>>>>>>> capacity that commodity hardware's hard disks or OSD server's
> >>>>>>>> hard disks can't provide this amount of storage capacity. thus
> >>>>>>>> we have to use storage systems as back-end for OSD hosts(to
> >>>>>>>> implement OSD
> >>>> daemons ).
> >>>>>>>>
> >>>>>>>> But because almost all of these storage systems ( Regardless of
> >>>>>>>> their
> >>>>>>>> brand) use Raid technology and also ceph replicate at least two
> >>>>>>>> copy of each Object, lot's amount of storage capacity waste.
> >>>>>>>>
> >>>>>>>> So is there any solution to solve this problem/misunderstand ?
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> Jack Makenz
> >>>>>>>>
> >>>>>>>> ----------
> >>>>>>>> From: *Nate Curry* <curry@xxxxxxxxxxxxx>
> >>>>>>>> Date: Mon, May 30, 2016 at 5:50 AM
> >>>>>>>> To: Jack Makenz <jack.makenz@xxxxxxxxx>
> >>>>>>>> Cc: Unknown <ceph-community@xxxxxxxxxxxxxx>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I think that purpose of ceph is to get away from having to rely
> >>>>>>>> on high end storage systems and to be provide the capacity to
> >>>>>>>> utilize multiple less expensive servers as the storage system.
> >>>>>>>>
> >>>>>>>> That being said you should still be able to use the high end
> >>>>>>>> storage systems with or without RAID enabled.  You could do
> >>>>>>>> away with RAID altogether and let Ceph handle the redundancy or
> >>>>>>>> you can have LUNs assigned to hosts be put into use as OSDs.
> >>>>>>>> You could make it work however but to get the most out of your
> >>>>>>>> storage with Ceph I think a non-RAID configuration would be best.
> >>>>>>>>
> >>>>>>>> Nate Curry
> >>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Ceph-community mailing list
> >>>>>>>>> Ceph-community@xxxxxxxxxxxxxx
> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> ----------
> >>>>>>>> From: *Doug Dressler* <darbymorrison@xxxxxxxxx>
> >>>>>>>> Date: Mon, May 30, 2016 at 6:02 AM
> >>>>>>>> To: Nate Curry <curry@xxxxxxxxxxxxx>
> >>>>>>>> Cc: Jack Makenz <jack.makenz@xxxxxxxxx>, Unknown <
> >>>>>>>> ceph-community@xxxxxxxxxxxxxx>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> For non-technical reasons I had to run ceph initially using SAN
> >>>>>>>> disks.
> >>>>>>>>
> >>>>>>>> Lesson learned:
> >>>>>>>>
> >>>>>>>> Make sure deduplication is disabled on the SAN :-)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ----------
> >>>>>>>> From: *Jack Makenz* <jack.makenz@xxxxxxxxx>
> >>>>>>>> Date: Mon, May 30, 2016 at 9:05 AM
> >>>>>>>> To: Nate Curry <curry@xxxxxxxxxxxxx>, ceph-
> >>>> community@xxxxxxxxxxxxxx
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks Nate,
> >>>>>>>> But as i mentioned before , providing petabytes of storage
> >>>>>>>> capacity on commodity hardware or enterprise servers is almost
> >>>>>>>> impossible, of course that it's possible by installing hundreds
> >>>>>>>> of servers with
> >>>>>>>> 3 terabytes hard disks, but this solution waste data center
> >>>>>>>> raise floor, power consumption and also *money* :)
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Christian Balzer        Network/Systems Engineer
> >>>>>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >>>>>>> http://www.gol.com/
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com