Re: Recomendations for building 1PB RadosGW with Erasure Code

Christian Balzer <chibi@xxxxxxx> · Thu, 18 Feb 2016 14:19:01 +0900

Hello,

On Wed, 17 Feb 2016 09:19:39 -0000 Nick Fisk wrote:

> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Christian Balzer
> > Sent: 17 February 2016 02:41
> > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Recomendations for building 1PB RadosGW with
> > Erasure Code
> > 
> > 
> > Hello,
> > 
> > On Tue, 16 Feb 2016 16:39:06 +0800 Василий Ангапов wrote:
> > 
> > > Nick, Tyler, many thanks for very helpful feedback!
> > > I spent many hours meditating on the following two links:
> > > http://www.supermicro.com/solutions/storage_ceph.cfm
> > > http://s3s.eu/cephshop
> > >
> > > 60- or even 72-disk nodes are very capacity-efficient, but will the 2
> > > CPUs (even the fastest ones) be enough to handle Erasure Coding?
> > >
> > Depends.
> > Since you're doing sequential writes (and reads I assume as you're
> > dealing with videos), CPU usage is going to be a lot lower than with
> > random, small 4KB block I/Os.
> > So most likely, yes.
> 
> That was my initial thought, but reading that paper I linked, the 4MB
> tests are the ones that bring the CPU's to their knees. I think the
> erasure calculation is a large part of the overall CPU usage and more
> data with the larger IO's causes a significant increase in CPU
> requirements.
> 
This is clearly where my total lack of EC exposure and experience is
showing, but it certainly makes sense as well.

> Correct me if I'm wrong, but I recall Christian, that your cluster is a
> full SSD cluster? 
No, but we talked back when I was building our 2nd production cluster and
while waiting for parts did make a temporary all SSD one by using all the
prospective journals SSDs.

And definitely maxed out on CPU long before the SSDs got busy when doing
4KB rados benches or similar.

OTOH that same machine only uses about 4 cores out of 16 when doing the
same thing in its current configuration with 8 HDDs and 4 journal SSDs.

> I think we touched on this before, that the ghz per
> OSD is probably more like 100mhz per IOP. In a spinning disk cluster,
> you effectively have a cap on the number of IOs you can serve before the
> disks max out. So the difference between large and small IO's is not
> that great. But on a SSD cluster there is no cap and so you just end up
> with more IO's, hence the higher CPU.
> 
Yes and that number is a good baseline (still).

My own rule of thumb is 1GHz or slightly less per OSD for pure HDD based
clusters and about 1.5GHz for ones with SSD journals. 
Round up for OS and (in my case frequently) MON usage.

Of course for purely SSD based OSDs, throw the kitchen sink at it, if
your wallet allows for it.

Christian
> > 
> > > Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
> > > combinations. I've did some calculations and found that the most
> > > efficient and safe configuration is to use 10 nodes with 29*6TB SATA
> > > and 7*200GB S3700 for journals. Assuming 6+3 EC profile that will
> > > give me
> > > 1.16 PB of effective space. Also I prefer not to use precious NVMe
> > > drives. Don't see any reason to use them.
> > >
> > This is probably your best way forward, dense is nice and cost saving,
> > but comes with a lot of potential gotchas.
> > Dense and large clusters can work, dense and small not so much.
> > 
> > > But what about RAM? Can I go with 64GB per node with above config?
> > > I've seen OSDs are consuming not more than 1GB RAM for replicated
> > > pools (even 6TB ones). But what is the typical memory usage of EC
> > > pools? Does anybody know that?
> > >
> > Above config (29 OSDs) that would be just about right.
> > I always go with at least 2GB RAM per OSD, since during a full node
> > restart and the consecutive peering OSDs will grow large, a LOT larger
> > than their usual steady state size.
> > RAM isn't that expensive these days and additional RAM comes in very
> > handy when used for pagecache and SLAB (dentry) stuff.
> > 
> > Something else to think about in your specific use case is to have
> > RAID'ed OSDs.
> > It's a bit of zero sum game probably, but compare the above config
> > with this. 11 nodes, each with:
> > 34 6TB SATAs (2x 17HDDs RAID6)
> > 2 200GB S3700 SSDs (journal/OS)
> > Just 2 OSDs per node.
> > Ceph with replication of 2.
> > Just shy of 1PB of effective space.
> > 
> > Minus: More physical space, less efficient HDD usage (replication vs.
> > EC).
> > 
> > Plus: A lot less expensive SSDs, less CPU and RAM requirements, smaller
> > impact in case of node failure/maintenance.
> > 
> > No ideas about the stuff below.
> > 
> > Christian
> > > Also, am I right that for 6+3 EC profile i need at least 10 nodes to
> > > feel comfortable (one extra node for redundancy)?
> > >
> > > And finally can someone recommend what EC plugin to use in my case? I
> > > know it's a difficult question but anyway?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2016-02-16 16:12 GMT+08:00 Nick Fisk <nick@xxxxxxxxxx>:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > >> Behalf Of Tyler Bishop
> > > >> Sent: 16 February 2016 04:20
> > > >> To: Василий Ангапов <angapov@xxxxxxxxx>
> > > >> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > > >> Subject: Re:  Recomendations for building 1PB RadosGW
> > > >> with Erasure Code
> > > >>
> > > >> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
> > > >>
> > > >> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
> > > >> Performance is excellent.
> > > >
> > > > Only thing I will say to the OP, is that if you only need 1PB, then
> > > > likely 4-5 of these will give you enough capacity. Personally I
> > > > would prefer to spread the capacity around more nodes. If you are
> > > > doing anything serious with Ceph its normally a good idea to try
> > > > and make each node no more than 10% of total capacity. Also with Ec
> > > > pools you will be limited to the K+M combo's you can achieve with
> > > > smaller number of nodes.
> > > >
> > > >>
> > > >> I would recommend a cache tier for sure if your data is busy for
> > > >> reads.
> > > >>
> > > >> Tyler Bishop
> > > >> Chief Technical Officer
> > > >> 513-299-7108 x10
> > > >>
> > > >>
> > > >>
> > > >> Tyler.Bishop@xxxxxxxxxxxxxxxxx
> > > >>
> > > >>
> > > >> If you are not the intended recipient of this transmission you are
> > > >> notified that disclosing, copying, distributing or taking any
> > > >> action in reliance on the contents of this information is strictly
> > > >> prohibited.
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Василий Ангапов" <angapov@xxxxxxxxx>
> > > >> To: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> > > >> Sent: Friday, February 12, 2016 7:44:07 AM
> > > >> Subject:  Recomendations for building 1PB RadosGW with
> > > >> Erasure       Code
> > > >>
> > > >> Hello,
> > > >>
> > > >> We are planning to build 1PB Ceph cluster for RadosGW with Erasure
> > > >> Code. It will be used for storing online videos.
> > > >> We do not expect outstanding write performace, something like 200-
> > > >> 300MB/s of sequental write will be quite enough, but data safety
> > > >> is very important.
> > > >> What are the most popular hardware and software recomendations?
> > > >> 1) What EC profile is best to use? What values of K/M do you
> > > >> recommend?
> > > >
> > > > The higher total k+m you go, you will require more CPU and
> > > > sequential performance will degrade slightly as the IO's are
> > > > smaller going to the disks. However larger numbers allow you to be
> > > > more creative with failure scenarios and "replication" efficiency.
> > > >
> > > >> 2) Do I need to use Cache Tier for RadosGW or it is only needed
> > > >> for RBD? Is it
> > > >
> > > > Only needed for RBD, but depending on workload it may still
> > > > benefit. If you are mostly doing large IO's, the gains will be a
> > > > lot smaller.
> > > >
> > > >> still an overall good practice to use Cache Tier for RadosGW?
> > > >> 3) What hardware is recommended for EC? I assume higher-clocked
> > > >> CPUs are needed? What about RAM?
> > > >
> > > > Total Ghz is more important (ie ghzxcores) Go with the
> > > > cheapest/power efficient you can get. Aim for somewhere around 1Ghz
> > per disk.
> > > >
> > > >> 4) What SSDs for Ceph journals are the best?
> > > >
> > > > Intel S3700 or P3700 (if you can stretch)
> > > >
> > > > By all means explore other options, but you can't go wrong by
> > > > buying these. Think "You can't get fired for buying Cisco" quote!!!
> > > >
> > > >>
> > > >> Thanks a lot!
> > > >>
> > > >> Regards, Vasily.
> > > >> _______________________________________________
> > > >> ceph-users mailing list
> > > >> ceph-users@xxxxxxxxxxxxxx
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >> _______________________________________________
> > > >> ceph-users mailing list
> > > >> ceph-users@xxxxxxxxxxxxxx
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com