Re: CephFS in the wild

Christian Balzer <chibi@xxxxxxx> · Mon, 6 Jun 2016 15:06:41 +0900

Hello,

On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:

> I'm hoping to implement cephfs in production at some point this year so
> I'd be interested to hear your progress on this.
> 
> Have you considered SSD for your metadata pool? You wouldn't need loads
> of capacity although even with reliable SSD I'd probably still do x3
> replication for metadata. I've been looking at the intel s3610's for
> this.
> 
That's an interesting and potentially quite beneficial thought, but it
depends on a number of things (more below).

I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
happy with that, but then again I have a very predictable usage pattern
and am monitoring those SSDs religiously and I'm sure they will outlive
things by a huge margin. 

We didn't go for 3x replication due to (in order):
a) cost
b) rack space
c) increased performance with 2x

Now for how useful/helpful a fast meta-data pool would be, I reckon it
depends on a number of things:

a) Is the cluster write or read heavy?
b) Do reads, flocks, anything that is not directly considered a read
   cause writes to the meta-data pool?
c) Anything else that might cause write storms to the meta-data pool, like
   bit in the current NFS over CephFS thread with sync?

A quick glance at my test cluster seems to indicate that CephFS meta data
per filesystem object is about 2KB, somebody with actual clues please
confirm this.

Brady has large amounts of NVMe space left over in his current design,
assuming 10GB journals about 2.8TB of raw space.
So if running the (verified) numbers indicates that the meta data can fit
in this space, I'd put it there.

Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage may
be the way forward.

Regards,

Christian
> 
> 
> On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> 
> > Question:
> > I'm curious if there is anybody else out there running CephFS at the
> > scale I'm planning for. I'd like to know some of the issues you didn't
> > expect that I should be looking out for. I'd also like to simply see
> > when CephFS hasn't worked out and why. Basically, give me your war
> > stories.
> >
> >
> > Problem Details:
> > Now that I'm out of my design phase and finished testing on VMs, I'm
> > ready to drop $100k on a pilo. I'd like to get some sense of
> > confidence from the community that this is going to work before I pull
> > the trigger.
> >
> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > CephFS by this time next year (hopefully by December). My workload is
> > a mix of small and vary large files (100GB+ in size). We do fMRI
> > analysis on DICOM image sets as well as other physio data collected
> > from subjects. We also have plenty of spreadsheets, scripts, etc.
> > Currently 90% of our analysis is I/O bound and generally sequential.
> >
> > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > currently provide. I'm also looking to get away from traditional
> > file-systems that require forklift upgrades. That's where Ceph really
> > shines for us.
> >
> > I don't have a total file count, but I do know that we have about 500k
> > directories.
> >
> >
> > Planned Architecture:
> >
> > Storage Interconnect:
> > Brocade VDX 6940 (40 gig)
> >
> > Access Switches for clients (servers):
> > Brocade VDX 6740 (10 gig)
> >
> > Access Switches for clients (workstations):
> > Brocade ICX 7450
> >
> > 3x MON:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> > 2x MDS:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB (is this necessary?)
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> > 8x OSD:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for Journals
> > 24x 6TB Enterprise SATA
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com