Re: CephFS in the wild

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:

> Question:
> I'm curious if there is anybody else out there running CephFS at the
> scale I'm planning for. I'd like to know some of the issues you didn't
> expect that I should be looking out for. I'd also like to simply see
> when CephFS hasn't worked out and why. Basically, give me your war
> stories.
>
Not me, but diligently search the archives, there are people with large
CephFS deployments (despite the non-production status when they did them).
Also look at the current horror story thread about what happens when you
have huge directories.

>
> Problem Details:
> Now that I'm out of my design phase and finished testing on VMs, I'm
> ready to drop $100k on a pilo. I'd like to get some sense of confidence
> from the community that this is going to work before I pull the trigger.
>
> I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> CephFS by this time next year (hopefully by December). My workload is a
> mix of small and vary large files (100GB+ in size). We do fMRI analysis
> on DICOM image sets as well as other physio data collected from
> subjects. We also have plenty of spreadsheets, scripts, etc. Currently
> 90% of our analysis is I/O bound and generally sequential.
>
There are other people here doing similar things (medical institutes,
universities), again search the archives and maybe contact them directly.

> In deploying Ceph, I am hoping to see more throughput than the 7320 can
> currently provide. I'm also looking to get away from traditional
> file-systems that require forklift upgrades. That's where Ceph really
> shines for us.
>
> I don't have a total file count, but I do know that we have about 500k
> directories.
>
>
> Planned Architecture:
>
Well, we talked about this 2 months ago, you seem to have changed only a
few things.
So lets dissect this again...

> Storage Interconnect:
> Brocade VDX 6940 (40 gig)
>
Is this a flat (single) network for all the storage nodes?
And then from these 40Gb/s switches links to the access switches?

This will start as a single 40Gb/s switch with a single link to each node (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will also be connected to several 10Gb/s and 1Gb/s access switches with dual 40Gb/s uplinks.

We do intend to segment the public and private networks using VLANs untagged at the node. There are obviously many subnets on our network. The 40Gb/s switch will handle routing for those networks.

You can see list discussion in "Public and Private network over 1 interface" May 23,2016 regarding some of this.
 

> Access Switches for clients (servers):
> Brocade VDX 6740 (10 gig)
>
> Access Switches for clients (workstations):
> Brocade ICX 7450
>
> 3x MON:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
Total overkill in the CPU core arena, fewer but faster cores would be more
suited for this task.
A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like
that, the closest one would be the E5-2643v4.

Same for RAM, MON processes are pretty frugal.

No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus
the leveldb) and that's being overly generous in the speed/IOPS department.

Note also that 40Gb/s isn't really needed here, alas latency and KISS do
speak in favor of it, especially if you can afford it.

Noted
 

> 2x MDS:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB (is this necessary?)
No, there isn't any persistent data with MDS, unlike what I assumed as
well before reading up on it and trying it out for the first time.

That's what I thought. For some reason, my VAR keeps throwing these on the config.
 

> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
Dedicated MONs/MDS are often a waste, they are suggested to avoid people
who don't know what they're doing from overloading things.

So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
the first one a dedicated MON and give it the lowest IP.
HW Specs as discussed above, make sure to use DIMMs that allow you to
upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
(from my limited experience with CephFS).
So:

128GB RAM (expandable to 256GB or more)
2x E5-2643v4
2x 400GB DC S3710
1x Dual Port 40Gb Ethernet

> 8x OSD:
> 128GB RAM
Use your savings above to make that 256GB for grate performance
improvements as hot objects stay in memory and so will all dir-entries (in
SLAB).

I like this idea.
 

> 2x 200GB SSD for OS
Overkill really. Other than the normally rather terse OSD logs, nothing
much will ever be written to them. So 3510s or at most 3610s.

> 2x 400GB P3700 for Journals
As discussed 2 months ago, this limits you to writes at half (or quarter
depending on your design and if you do LACP, vLAG) of what your network is
capable of.
OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s either
(at least with filestore and bluestore is a year away at best).
So good enough, especially if you're read heavy.

Yeah, the thought is that we're going to be close to equilibrium. It's not too big a deal to add an extra card, so my plan was to expand to 3 if necessary after our pilot project.
 

> 24x 6TB Enterprise SATA
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet

Regards,

Christian

As always, I appreciate your comments and time. I'm looking forward to joining you and the rest of the community in operating a great Ceph environment.
 
--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux