Re: CephFS in the wild

Christian Balzer <chibi@xxxxxxx> · Fri, 3 Jun 2016 10:58:43 +0900

On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:

> On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> >
> > > Question:
> > > I'm curious if there is anybody else out there running CephFS at the
> > > scale I'm planning for. I'd like to know some of the issues you
> > > didn't expect that I should be looking out for. I'd also like to
> > > simply see when CephFS hasn't worked out and why. Basically, give me
> > > your war stories.
> > >
> > Not me, but diligently search the archives, there are people with large
> > CephFS deployments (despite the non-production status when they did
> > them). Also look at the current horror story thread about what happens
> > when you have huge directories.
> >
> > >
> > > Problem Details:
> > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > confidence from the community that this is going to work before I
> > > pull the trigger.
> > >
> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> > > with CephFS by this time next year (hopefully by December). My
> > > workload is a mix of small and vary large files (100GB+ in size). We
> > > do fMRI analysis on DICOM image sets as well as other physio data
> > > collected from subjects. We also have plenty of spreadsheets,
> > > scripts, etc. Currently 90% of our analysis is I/O bound and
> > > generally sequential.
> > >
> > There are other people here doing similar things (medical institutes,
> > universities), again search the archives and maybe contact them
> > directly.
> >
> > > In deploying Ceph, I am hoping to see more throughput than the 7320
> > > can currently provide. I'm also looking to get away from traditional
> > > file-systems that require forklift upgrades. That's where Ceph really
> > > shines for us.
> > >
> > > I don't have a total file count, but I do know that we have about
> > > 500k directories.
> > >
> > >
> > > Planned Architecture:
> > >
> > Well, we talked about this 2 months ago, you seem to have changed only
> > a few things.
> > So lets dissect this again...
> >
> > > Storage Interconnect:
> > > Brocade VDX 6940 (40 gig)
> > >
> > Is this a flat (single) network for all the storage nodes?
> > And then from these 40Gb/s switches links to the access switches?
> >
> 
> This will start as a single 40Gb/s switch with a single link to each node
> (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch
> will also be connected to several 10Gb/s and 1Gb/s access switches with
> dual 40Gb/s uplinks.
> 
So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
clients.
Network wise, your 8 storage servers outstrip that, actual storage
bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best case
writes, so a match.

> We do intend to segment the public and private networks using VLANs
> untagged at the node. There are obviously many subnets on our network.
> The 40Gb/s switch will handle routing for those networks.
> 
> You can see list discussion in "Public and Private network over 1
> interface" May 23,2016 regarding some of this.
> 
And I did comment in that thread, the final one actually. ^o^

Unless you can come up with a _very_ good reason not covered in that
thread, I'd keep it to one network.

Once the 2nd switch is in place and running vLAG (LACP on your servers)
your network bandwidth per host VASTLY exceeds that of your storage.

> 
> >
> > > Access Switches for clients (servers):
> > > Brocade VDX 6740 (10 gig)
> > >
> > > Access Switches for clients (workstations):
> > > Brocade ICX 7450
> > >
> > > 3x MON:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for LevelDB
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> > >
> > Total overkill in the CPU core arena, fewer but faster cores would be
> > more suited for this task.
> > A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing
> > like that, the closest one would be the E5-2643v4.
> >
> > Same for RAM, MON processes are pretty frugal.
> >
> > No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and
> > thus the leveldb) and that's being overly generous in the speed/IOPS
> > department.
> >
> > Note also that 40Gb/s isn't really needed here, alas latency and KISS
> > do speak in favor of it, especially if you can afford it.
> >
> 
> Noted
> 
> 
> >
> > > 2x MDS:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for LevelDB (is this necessary?)
> > No, there isn't any persistent data with MDS, unlike what I assumed as
> > well before reading up on it and trying it out for the first time.
> >
> 
> That's what I thought. For some reason, my VAR keeps throwing these on
> the config.
> 
That's their job after all, selling you hardware that you don't need so
that they can create added value (for themselves). ^o^

> 
> >
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> > >
> > Dedicated MONs/MDS are often a waste, they are suggested to avoid
> > people who don't know what they're doing from overloading things.
> >
> > So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
> > the first one a dedicated MON and give it the lowest IP.
> > HW Specs as discussed above, make sure to use DIMMs that allow you to
> > upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
> > (from my limited experience with CephFS).
> > So:
> >
> > 128GB RAM (expandable to 256GB or more)
> > 2x E5-2643v4
> > 2x 400GB DC S3710
> > 1x Dual Port 40Gb Ethernet
> >
> > > 8x OSD:
> > > 128GB RAM
> > Use your savings above to make that 256GB for grate performance
> > improvements as hot objects stay in memory and so will all dir-entries
> > (in SLAB).
> >
> 
> I like this idea.
> 
> 
> >
> > > 2x 200GB SSD for OS
> > Overkill really. Other than the normally rather terse OSD logs, nothing
> > much will ever be written to them. So 3510s or at most 3610s.
> >
> > > 2x 400GB P3700 for Journals
> > As discussed 2 months ago, this limits you to writes at half (or
> > quarter depending on your design and if you do LACP, vLAG) of what
> > your network is capable of.
> > OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s
> > either (at least with filestore and bluestore is a year away at best).
> > So good enough, especially if you're read heavy.
> >
> 
> Yeah, the thought is that we're going to be close to equilibrium. It's
> not too big a deal to add an extra card, so my plan was to expand to 3 if
> necessary after our pilot project.
> 
Probably not needed, because the moment you're not doing synthetic
large sequential write only tests you will find that random writes and
reads (these can be offset up to a point by the large RAM) will slow your
HDDs down well below the speed of the NVMes. 

Christian
> 
> >
> > > 24x 6TB Enterprise SATA
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> >
> > Regards,
> >
> > Christian
> >
> 
> As always, I appreciate your comments and time. I'm looking forward to
> joining you and the rest of the community in operating a great Ceph
> environment.
> 
> 
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com