Re: Fwd: [Ceph-community] Wasting the Storage capacity when using Ceph based On high-end storage systems

Nick Fisk <nick@xxxxxxxxxx> · Tue, 31 May 2016 12:05:45 +0100

Hi Oliver,

Thanks for this, very interesting and relevant to me at the moment as your
two hardware platforms mirror exactly my existing and new cluster. 

Just a couple of comments inline.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Oliver Dzombic
> Sent: 31 May 2016 11:26
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Fwd: [Ceph-community] Wasting the Storage
> capacity when using Ceph based On high-end storage systems
> 
> Hi Nick,
> 
> we have here running on one node:
> 
> E3-1225v5 ( 4x 3,3 GHz ), 32 GB RAM ( newceph1 )
> 
> looks like this: http://pastebin.com/btVpeJrE
> 
> 
> and we have here running on another node:
> 
> 
> 2x E5-2620v3 ( 12x 2,4 GHz + HT units ), 64 GB RAM ( new-ceph2 )
> 
> looks like this: http://pastebin.com/S6XYbwzw
> 
> 
> 
> The corresponding ceph tree
> 
> looks like this: http://pastebin.com/evRqwNT2
> 
> That all running a replication of 2.
> 
> -----------
> 
> So as we can see, we run the same number ( 10 ) of HDDs.
> Model is 3 TB 7200 RPM's 128 MB Cache on this two nodes.
> 
> An ceph osd perf looks like: http://pastebin.com/4d0Xik0m
> 
> 
> What you see right now is normal, everyday, load on a healthy cluster.
> 
> So now, because we are mean, lets turn on deeb scrubbing in the middle of
> the day ( to give the people a reason to make a coffee break now at
> 12:00 CET ).
> 
> ------------
> 
> So now for the E3: http://pastebin.com/ZagKnhBQ
> 
> And for the E5: http://pastebin.com/2J4zqqNW
> 
> And again our osd perf: http://pastebin.com/V6pKGp9u
> 
> 
> ---------------
> 
> 
> So my conclusion out of that is that the E3 CPU becomes faster overloaded
(
> 4 Cores with load 12/13 ) vs. ( 24 vCores with load 18/19 )
> 
> To me, even i can not really meassure it, and even osd perf show's a lower
> latency at the E3 OSD's compared to the E5 OSD's, i can see with the E3
CPU
> stats, that its frequently running into 0% Idle because the CPU has to
wait for
> the HDDs ( %WA ). And because the core has to wait for the hardware, its
> CPU power will not be used for something else because its in waiting
state.

Not sure if this was visible in the paste dump, but what is the run queue
for both systems? When I looked into this a while back, I thought load
included IOWait in its calculation, but IOWait itself didn't stop another
thread getting run on the CPU if needed. Effectively IOWait = IDLE (if
needed). From what I understand the run queue on the CPU dictates whether or
not there are too many threads queuing to run and thus slow performance. So
I think your example for the E3 shows that there is still around 66% of the
CPU available for processing. As a test, could you trying running something
like "stress" to consume CPU cycles and see if the IOWait drops?

> 
> So even the E3 CPU get the job done faster, the HDD's are usually too slow
> and the bottleneck. So the E3 can not take real advantage of his higher
> Power per core. And because he has a low number of cores, the number of
> "waiting state" cores becomes fast as big as the total number of cpu
cores.

I did some testing last year by scaling the CPU frequency and measuring
write latency. If you are using SSD journals, then I found the frequency
makes a massive difference to small write IO's. If you are doing mainly
reads with HDD's, then faster Cores probably won't do much.

> 
> The result is, that there is an overload of the system, and we are running
in
> an evil nightmare of IOPS.

Are you actually seeing problems with the cluster? I would interested to
hear what you are encountering?

> 
> But again: I can not meassure it really. I can not see which HDD delivers
which
> data and how fast.
> 
> So maybe the E5 is slowing down the whole stuff. Maybe not.
> 
> But for me, the probability, that a 4 Core System with 0% Idle left at
> 12/13 systemload is "guilty", is higher than a 24v Core System, with still
~ 50%
> Idletime and a 18/19 systemload.
> 
> But, of course, i have to admit, because of the 32 GB RAM vs. 64 GB RAM,
the
> compare might be more like apple's and orange's. Maybe with similar RAM,
> the system will perform similar.

I'm sticking 64GB in my E3 servers to be on the safe side.

> 
> But you can judge the stats yourself, and maybe gain some knowledge from
> it :-)
> 
> For us, what we will do here next, after jewel is now out, we will build
up a
> new cluster with:
> 
> 2x E5-2620v3, 128 GB RAM, HBA -> JBOD configuration while we will add a
> SSD cache tier. So right now, i still believe, that with the E3, because
of the
> limited number of cores, you are more limited on the maximum numbers of
> OSD's you can run with it.

If you do need more cores, I think a better solution might be a 8 or 10 core
single CPU. There seems to be a lot of evidence that sticking with a single
socket is best for ceph if you can.

> 
> Maybe with your E3, your 12 HDDs ( depending on/especially if you have
> (SSD) cache tier in between ) will run fine. But i think you are going
here into
> an area where, in special conditions ( hardware failure/deeb scrub/... )
your
> storage performance with an E3 will fastly loose so much speed, that your
> applications will not operate smooth anymore.
> 
> But again, many factor's are involved, so make ur own picture :-)
> 
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:info@xxxxxxxxxxxxxxxxx
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 31.05.2016 um 09:41 schrieb Nick Fisk:
> > Hi Oliver,
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Oliver Dzombic
> >> Sent: 30 May 2016 16:32
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  Fwd: [Ceph-community] Wasting the Storage
> >> capacity when using Ceph based On high-end storage systems
> >>
> >> Hi,
> >>
> >> E3 CPUs have 4 Cores, with HT Unit. So 8 logical Cores. And they are
> >> not
> > multi
> >> CPU.
> >>
> >> That means you will be naturally ( fastly ) limited in the number of
> >> OSD's
> > you
> >> can run with that.
> >
> > I'm hoping to be able to run 12, do you think that will be a struggle?
> >
> >>
> >> Because no matter how much Ghz it has, the OSD process occupy a cpu
> >> core for ever.
> >
> > I'm not sure I agree with this point. A OSD process is comprised of
> > 10's of threads, which unless you have pinned the process to a single
> > core, will be running randomly across all the cores on the CPU. As far
> > as I'm aware, all these threads are given a 10ms time slice and then
> > scheduled to run on the next available core. A 4x4Ghz CPU will run all
> > these threads faster than a 8x2Ghz CPU, this is where the latency
> advantages are seen.
> >
> > If you get to the point you have 100's of threads all demanding CPU
> > time, a 4x4Ghz CPU will be roughly the same speed as a 8x2Ghz CPU. Yes
> > there are half the cores available, but each core completes its work in
half
> the time.
> > There may be some advantages with ever increasing thread counts, but
> > there is also disadvantages with memory/IO access over the inter CPU
> > link in the case of dual sockets.
> >
> >> Not for 100%, but still enough, to ruin ur day, if you have 8 logical
> >> cores and 12 disks ( in scrubbing/backfilling/high load ).
> >
> > I did some testing with a 12 Core 2Ghz Xeon E5 (2x6) by disabling 8
> > cores and performance was sufficient. I know E3 and E5 are different
> > CPU families, but hopefully this was a good enough test.
> >
> >>
> >> So all single Core CPU's are just good for a very limited amount of
OSD's.
> >
> >> --
> >> Mit freundlichen Gruessen / Best regards
> >>
> >> Oliver Dzombic
> >> IP-Interactive
> >>
> >> mailto:info@xxxxxxxxxxxxxxxxx
> >>
> >> Anschrift:
> >>
> >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> >> 63571 Gelnhausen
> >>
> >> HRB 93402 beim Amtsgericht Hanau
> >> Geschäftsführung: Oliver Dzombic
> >>
> >> Steuer Nr.: 35 236 3622 1
> >> UST ID: DE274086107
> >>
> >>
> >> Am 30.05.2016 um 17:13 schrieb Christian Balzer:
> >>>
> >>> Hello,
> >>>
> >>> On Mon, 30 May 2016 09:40:11 +0100 Nick Fisk wrote:
> >>>
> >>>> The other option is to scale out rather than scale up. I'm
> >>>> currently building nodes based on a fast Xeon E3 with 12 Drives in
> >>>> 1U. The MB/CPU is very attractively priced and the higher clock
> >>>> gives you much lower write latency if that is important. The
> >>>> density is slightly lower, but I guess you gain an advantage in
> >>>> more granularity
> > of the
> >> cluster.
> >>>>
> >>> Most definitely, granularity and number of OSDs (up to a point, mind
> >>> ya) is a good thing [TM].
> >>>
> >>> I was citing the designs I did to basically counter the "not dense
> > enough"
> >>> argument.
> >>>
> >>> Ultimately with Ceph (unless you throw lots of money and brain cells
> >>> at it), the less dense, the better it will perform.
> >>>
> >>> Christian
> >>>>> -----Original Message-----
> >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>> Behalf Of Jack Makenz
> >>>>> Sent: 30 May 2016 08:40
> >>>>> To: Christian Balzer <chibi@xxxxxxx>
> >>>>> Cc: ceph-users@xxxxxxxxxxxxxx
> >>>>> Subject: Re:  Fwd: [Ceph-community] Wasting the
> >>>>> Storage capacity when using Ceph based On high-end storage
> systems
> >>>>>
> >>>>> Thanks Christian, and all of ceph users
> >>>>>
> >>>>> Your guidance was very helpful, appreciate !
> >>>>>
> >>>>> Regards
> >>>>> Jack Makenz
> >>>>>
> >>>>> On Mon, May 30, 2016 at 11:08 AM, Christian Balzer <chibi@xxxxxxx>
> >>>>> wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> you may want to read up on the various high-density node threads
> >>>>> and conversations here.
> >>>>>
> >>>>> You most certainly do NOT need high end-storage systems to create
> >>>>> multi-petabyte storage systems with Ceph.
> >>>>>
> >>>>> If you were to use these chassis as a basis:
> >>>>>
> >>>>> https://www.supermicro.com.tw/products/system/4U/6048/SSG-
> >> 6048R-
> >>>>> E1CR60N.cfm
> >>>>> [We (and surely others) urged Supermicro to provide a design like
> >>>>> this]
> >>>>>
> >>>>> And fill them with 6TB HDDs, configure them as 5x 12HDD RAID6s,
> >>>>> set your replication to 2 in Ceph, you will wind up with VERY
> >>>>> reliable, resilient 1.2PB per rack (32U, leaving space for other
> >>>>> bits and not melting the PDUs).
> >>>>> Add fast SSDs or NVMes to this case for journals and you have
> >>>>> decently performing mass storage.
> >>>>>
> >>>>> Need more IOPS for really hot data?
> >>>>> Add a cache tier or dedicated SSD pools for special needs/customers.
> >>>>>
> >>>>> Alternatively, do "classic" Ceph with 3x replication or EC coding,
> >>>>> but in either case (even more so with EC) you will need the most
> >>>>> firebreathing CPUs available, so compared to the above design it
> >>>>> may be a zero sum game cost wise, if not performance wise as well.
> >>>>> This leaves you with 960TB in the same space when doing 3x
> > replication.
> >>>>>
> >>>>> A middle of the road approach would be to use RAID1 or 10 based
> >>>>> OSDs to bring down the computational needs in exchange for higher
> >>>>> storage costs (effective 4x replication).
> >>>>> This only gives you 720TB, alas it will be easier (and cheaper CPU
> >>>>> cost
> >>>>> wise) to achieve peak performance with this approach compared to
> >>>>> the one above with 60 OSDs per node.
> >>>>>
> >>>>> Lastly, I give you this (and not being a fan of Fujitsu, mind):
> >>>>>
> http://www.fujitsu.com/global/products/computing/storage/eternus-
> >> cd/
> >>>>>
> >>>>> Christian
> >>>>>
> >>>>> On Mon, 30 May 2016 10:25:35 +0430 Jack Makenz wrote:
> >>>>>
> >>>>>> Forwarded conversation
> >>>>>> Subject: Wasting the Storage capacity when using Ceph based On
> >>>>>> high-end storage systems
> >>>>>> ------------------------
> >>>>>>
> >>>>>> From: *Jack Makenz* <jack.makenz@xxxxxxxxx>
> >>>>>> Date: Sun, May 29, 2016 at 6:52 PM
> >>>>>> To: ceph-community@xxxxxxxxxxxxxx
> >>>>>>
> >>>>>>
> >>>>>> Hello All,
> >>>>>> There are some serious problem about ceph that may waste storage
> >>>>> capacity
> >>>>>> when using high-end storage system(Hitachi, IBM, EMC, HP ,...) as
> >>>>>> back-end for OSD hosts.
> >>>>>>
> >>>>>> Imagine in the real cloud we need  *n Petabytes* of storage
> >>>>>> capacity that commodity hardware's hard disks or OSD server's
> >>>>>> hard disks can't provide this amount of storage capacity. thus we
> >>>>>> have to use storage systems as back-end for OSD hosts(to
> >>>>>> implement OSD
> >> daemons ).
> >>>>>>
> >>>>>> But because almost all of these storage systems ( Regardless of
> >>>>>> their
> >>>>>> brand) use Raid technology and also ceph replicate at least two
> >>>>>> copy of each Object, lot's amount of storage capacity waste.
> >>>>>>
> >>>>>> So is there any solution to solve this problem/misunderstand ?
> >>>>>>
> >>>>>> Regards
> >>>>>> Jack Makenz
> >>>>>>
> >>>>>> ----------
> >>>>>> From: *Nate Curry* <curry@xxxxxxxxxxxxx>
> >>>>>> Date: Mon, May 30, 2016 at 5:50 AM
> >>>>>> To: Jack Makenz <jack.makenz@xxxxxxxxx>
> >>>>>> Cc: Unknown <ceph-community@xxxxxxxxxxxxxx>
> >>>>>>
> >>>>>>
> >>>>>> I think that purpose of ceph is to get away from having to rely
> >>>>>> on high end storage systems and to be provide the capacity to
> >>>>>> utilize multiple less expensive servers as the storage system.
> >>>>>>
> >>>>>> That being said you should still be able to use the high end
> >>>>>> storage systems with or without RAID enabled.  You could do away
> >>>>>> with RAID altogether and let Ceph handle the redundancy or you
> >>>>>> can have LUNs assigned to hosts be put into use as OSDs.  You
> >>>>>> could make it work however but to get the most out of your
> >>>>>> storage with Ceph I think a non-RAID configuration would be best.
> >>>>>>
> >>>>>> Nate Curry
> >>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Ceph-community mailing list
> >>>>>>> Ceph-community@xxxxxxxxxxxxxx
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
> >>>>>>>
> >>>>>>>
> >>>>>> ----------
> >>>>>> From: *Doug Dressler* <darbymorrison@xxxxxxxxx>
> >>>>>> Date: Mon, May 30, 2016 at 6:02 AM
> >>>>>> To: Nate Curry <curry@xxxxxxxxxxxxx>
> >>>>>> Cc: Jack Makenz <jack.makenz@xxxxxxxxx>, Unknown <
> >>>>>> ceph-community@xxxxxxxxxxxxxx>
> >>>>>>
> >>>>>>
> >>>>>> For non-technical reasons I had to run ceph initially using SAN
> >>>>>> disks.
> >>>>>>
> >>>>>> Lesson learned:
> >>>>>>
> >>>>>> Make sure deduplication is disabled on the SAN :-)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ----------
> >>>>>> From: *Jack Makenz* <jack.makenz@xxxxxxxxx>
> >>>>>> Date: Mon, May 30, 2016 at 9:05 AM
> >>>>>> To: Nate Curry <curry@xxxxxxxxxxxxx>, ceph-
> >> community@xxxxxxxxxxxxxx
> >>>>>>
> >>>>>>
> >>>>>> Thanks Nate,
> >>>>>> But as i mentioned before , providing petabytes of storage
> >>>>>> capacity on commodity hardware or enterprise servers is almost
> >>>>>> impossible, of course that it's possible by installing hundreds
> >>>>>> of servers with
> >>>>>> 3 terabytes hard disks, but this solution waste data center raise
> >>>>>> floor, power consumption and also *money* :)
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Christian Balzer        Network/Systems Engineer
> >>>>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >>>>> http://www.gol.com/
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com