Re: CephFS in the wild

Christian Balzer <chibi@xxxxxxx> · Tue, 7 Jun 2016 08:33:17 +0900

Hello,

On Mon, 6 Jun 2016 14:14:17 -0500 Brady Deetz wrote:

> This is an interesting idea that I hadn't yet considered testing.
> 
> My test cluster is also looking like 2K per object.
> 
> It looks like our hardware purchase for a one-half sized pilot is getting
> approved and I don't really want to modify it when we're this close to
> moving forward. So, using spare NVMe capacity is certainly an option, but
> increasing my OS disk size or replacing OSDs is pretty much a no go for
> this iteration of the cluster.
> 
> My single concern with the idea of using the NVMe capacity is the
> potential to affect journal performance which is already cutting it
> close with each NVMe supporting 12 journals. 

I thought you might say that. ^o^

Consider however that when your journals are busy due to massive large
writes, that also means little meta-data activity.

>It seems to me what would
> probably be better would be to replace 2 HDD OSDs with 2 SSD OSDs and
> put the metadata pool on those dedicated SSDs. Even if testing goes well
> on the NVMe based pool, dedicated SSDs seem like a safer play and may be
> what I implement when we buy our second round of hardware to finish out
> the cluster and go live (January-March 2017).
> 
Again, if you can afford this, bully for you. ^_^
With dedicated SSDs small to medium sized S3710s are probably the way
forward.

Christian 
> 
> 
> On Mon, Jun 6, 2016 at 12:02 PM, David <dclistslinux@xxxxxxxxx> wrote:
> 
> >
> >
> > On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >>
> >> Hello,
> >>
> >> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
> >>
> >> > I'm hoping to implement cephfs in production at some point this
> >> > year so I'd be interested to hear your progress on this.
> >> >
> >> > Have you considered SSD for your metadata pool? You wouldn't need
> >> > loads of capacity although even with reliable SSD I'd probably
> >> > still do x3 replication for metadata. I've been looking at the
> >> > intel s3610's for this.
> >> >
> >> That's an interesting and potentially quite beneficial thought, but it
> >> depends on a number of things (more below).
> >>
> >> I'm using S3610s (800GB) for a cache pool with 2x replication and am
> >> quite happy with that, but then again I have a very predictable usage
> >> pattern and am monitoring those SSDs religiously and I'm sure they
> >> will outlive things by a huge margin.
> >>
> >> We didn't go for 3x replication due to (in order):
> >> a) cost
> >> b) rack space
> >> c) increased performance with 2x
> >
> >
> > I'd also be happy with 2x replication for data pools and that's
> > probably what I'll do for the reasons you've given. I plan on using
> > File Layouts to map some dirs to the ssd pool. I'm testing this at the
> > moment and it's an awesome feature. I'm just very paranoid with the
> > metadata and considering the relatively low capacity requirement I'd
> > stick with the 3x replication although as you say that means a
> > performance hit.
> >
> >
> >>
> >> Now for how useful/helpful a fast meta-data pool would be, I reckon it
> >> depends on a number of things:
> >>
> >> a) Is the cluster write or read heavy?
> >> b) Do reads, flocks, anything that is not directly considered a read
> >>    cause writes to the meta-data pool?
> >> c) Anything else that might cause write storms to the meta-data pool,
> >> like bit in the current NFS over CephFS thread with sync?
> >>
> >> A quick glance at my test cluster seems to indicate that CephFS meta
> >> data per filesystem object is about 2KB, somebody with actual clues
> >> please confirm this.
> >>
> >
> > 2K per object appears to be the case on my test cluster too.
> >
> >
> >> Brady has large amounts of NVMe space left over in his current design,
> >> assuming 10GB journals about 2.8TB of raw space.
> >> So if running the (verified) numbers indicates that the meta data can
> >> fit in this space, I'd put it there.
> >>
> >> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool
> >> storage may
> >> be the way forward.
> >>
> >> Regards,
> >>
> >> Christian
> >> >
> >> >
> >> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz <bdeetz@xxxxxxxxx>
> >> > wrote:
> >> >
> >> > > Question:
> >> > > I'm curious if there is anybody else out there running CephFS at
> >> > > the scale I'm planning for. I'd like to know some of the issues
> >> > > you didn't expect that I should be looking out for. I'd also like
> >> > > to simply see when CephFS hasn't worked out and why. Basically,
> >> > > give me your war stories.
> >> > >
> >> > >
> >> > > Problem Details:
> >> > > Now that I'm out of my design phase and finished testing on VMs,
> >> > > I'm ready to drop $100k on a pilo. I'd like to get some sense of
> >> > > confidence from the community that this is going to work before I
> >> > > pull the trigger.
> >> > >
> >> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> >> with
> >> > > CephFS by this time next year (hopefully by December). My
> >> > > workload is a mix of small and vary large files (100GB+ in size).
> >> > > We do fMRI analysis on DICOM image sets as well as other physio
> >> > > data collected from subjects. We also have plenty of
> >> > > spreadsheets, scripts, etc. Currently 90% of our analysis is I/O
> >> > > bound and generally sequential.
> >> > >
> >> > > In deploying Ceph, I am hoping to see more throughput than the
> >> > > 7320
> >> can
> >> > > currently provide. I'm also looking to get away from traditional
> >> > > file-systems that require forklift upgrades. That's where Ceph
> >> > > really shines for us.
> >> > >
> >> > > I don't have a total file count, but I do know that we have about
> >> > > 500k directories.
> >> > >
> >> > >
> >> > > Planned Architecture:
> >> > >
> >> > > Storage Interconnect:
> >> > > Brocade VDX 6940 (40 gig)
> >> > >
> >> > > Access Switches for clients (servers):
> >> > > Brocade VDX 6740 (10 gig)
> >> > >
> >> > > Access Switches for clients (workstations):
> >> > > Brocade ICX 7450
> >> > >
> >> > > 3x MON:
> >> > > 128GB RAM
> >> > > 2x 200GB SSD for OS
> >> > > 2x 400GB P3700 for LevelDB
> >> > > 2x E5-2660v4
> >> > > 1x Dual Port 40Gb Ethernet
> >> > >
> >> > > 2x MDS:
> >> > > 128GB RAM
> >> > > 2x 200GB SSD for OS
> >> > > 2x 400GB P3700 for LevelDB (is this necessary?)
> >> > > 2x E5-2660v4
> >> > > 1x Dual Port 40Gb Ethernet
> >> > >
> >> > > 8x OSD:
> >> > > 128GB RAM
> >> > > 2x 200GB SSD for OS
> >> > > 2x 400GB P3700 for Journals
> >> > > 24x 6TB Enterprise SATA
> >> > > 2x E5-2660v4
> >> > > 1x Dual Port 40Gb Ethernet
> >> > >
> >> > > _______________________________________________
> >> > > ceph-users mailing list
> >> > > ceph-users@xxxxxxxxxxxxxx
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > >
> >> > >
> >>
> >>
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> http://www.gol.com/
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com