Hello, On Mon, 6 Jun 2016 14:14:17 -0500 Brady Deetz wrote: > This is an interesting idea that I hadn't yet considered testing. > > My test cluster is also looking like 2K per object. > > It looks like our hardware purchase for a one-half sized pilot is getting > approved and I don't really want to modify it when we're this close to > moving forward. So, using spare NVMe capacity is certainly an option, but > increasing my OS disk size or replacing OSDs is pretty much a no go for > this iteration of the cluster. > > My single concern with the idea of using the NVMe capacity is the > potential to affect journal performance which is already cutting it > close with each NVMe supporting 12 journals. I thought you might say that. ^o^ Consider however that when your journals are busy due to massive large writes, that also means little meta-data activity. >It seems to me what would > probably be better would be to replace 2 HDD OSDs with 2 SSD OSDs and > put the metadata pool on those dedicated SSDs. Even if testing goes well > on the NVMe based pool, dedicated SSDs seem like a safer play and may be > what I implement when we buy our second round of hardware to finish out > the cluster and go live (January-March 2017). > Again, if you can afford this, bully for you. ^_^ With dedicated SSDs small to medium sized S3710s are probably the way forward. Christian > > > On Mon, Jun 6, 2016 at 12:02 PM, David <dclistslinux@xxxxxxxxx> wrote: > > > > > > > On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > > >> > >> Hello, > >> > >> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote: > >> > >> > I'm hoping to implement cephfs in production at some point this > >> > year so I'd be interested to hear your progress on this. > >> > > >> > Have you considered SSD for your metadata pool? You wouldn't need > >> > loads of capacity although even with reliable SSD I'd probably > >> > still do x3 replication for metadata. I've been looking at the > >> > intel s3610's for this. > >> > > >> That's an interesting and potentially quite beneficial thought, but it > >> depends on a number of things (more below). > >> > >> I'm using S3610s (800GB) for a cache pool with 2x replication and am > >> quite happy with that, but then again I have a very predictable usage > >> pattern and am monitoring those SSDs religiously and I'm sure they > >> will outlive things by a huge margin. > >> > >> We didn't go for 3x replication due to (in order): > >> a) cost > >> b) rack space > >> c) increased performance with 2x > > > > > > I'd also be happy with 2x replication for data pools and that's > > probably what I'll do for the reasons you've given. I plan on using > > File Layouts to map some dirs to the ssd pool. I'm testing this at the > > moment and it's an awesome feature. I'm just very paranoid with the > > metadata and considering the relatively low capacity requirement I'd > > stick with the 3x replication although as you say that means a > > performance hit. > > > > > >> > >> Now for how useful/helpful a fast meta-data pool would be, I reckon it > >> depends on a number of things: > >> > >> a) Is the cluster write or read heavy? > >> b) Do reads, flocks, anything that is not directly considered a read > >> cause writes to the meta-data pool? > >> c) Anything else that might cause write storms to the meta-data pool, > >> like bit in the current NFS over CephFS thread with sync? > >> > >> A quick glance at my test cluster seems to indicate that CephFS meta > >> data per filesystem object is about 2KB, somebody with actual clues > >> please confirm this. > >> > > > > 2K per object appears to be the case on my test cluster too. > > > > > >> Brady has large amounts of NVMe space left over in his current design, > >> assuming 10GB journals about 2.8TB of raw space. > >> So if running the (verified) numbers indicates that the meta data can > >> fit in this space, I'd put it there. > >> > >> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool > >> storage may > >> be the way forward. > >> > >> Regards, > >> > >> Christian > >> > > >> > > >> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz <bdeetz@xxxxxxxxx> > >> > wrote: > >> > > >> > > Question: > >> > > I'm curious if there is anybody else out there running CephFS at > >> > > the scale I'm planning for. I'd like to know some of the issues > >> > > you didn't expect that I should be looking out for. I'd also like > >> > > to simply see when CephFS hasn't worked out and why. Basically, > >> > > give me your war stories. > >> > > > >> > > > >> > > Problem Details: > >> > > Now that I'm out of my design phase and finished testing on VMs, > >> > > I'm ready to drop $100k on a pilo. I'd like to get some sense of > >> > > confidence from the community that this is going to work before I > >> > > pull the trigger. > >> > > > >> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 > >> with > >> > > CephFS by this time next year (hopefully by December). My > >> > > workload is a mix of small and vary large files (100GB+ in size). > >> > > We do fMRI analysis on DICOM image sets as well as other physio > >> > > data collected from subjects. We also have plenty of > >> > > spreadsheets, scripts, etc. Currently 90% of our analysis is I/O > >> > > bound and generally sequential. > >> > > > >> > > In deploying Ceph, I am hoping to see more throughput than the > >> > > 7320 > >> can > >> > > currently provide. I'm also looking to get away from traditional > >> > > file-systems that require forklift upgrades. That's where Ceph > >> > > really shines for us. > >> > > > >> > > I don't have a total file count, but I do know that we have about > >> > > 500k directories. > >> > > > >> > > > >> > > Planned Architecture: > >> > > > >> > > Storage Interconnect: > >> > > Brocade VDX 6940 (40 gig) > >> > > > >> > > Access Switches for clients (servers): > >> > > Brocade VDX 6740 (10 gig) > >> > > > >> > > Access Switches for clients (workstations): > >> > > Brocade ICX 7450 > >> > > > >> > > 3x MON: > >> > > 128GB RAM > >> > > 2x 200GB SSD for OS > >> > > 2x 400GB P3700 for LevelDB > >> > > 2x E5-2660v4 > >> > > 1x Dual Port 40Gb Ethernet > >> > > > >> > > 2x MDS: > >> > > 128GB RAM > >> > > 2x 200GB SSD for OS > >> > > 2x 400GB P3700 for LevelDB (is this necessary?) > >> > > 2x E5-2660v4 > >> > > 1x Dual Port 40Gb Ethernet > >> > > > >> > > 8x OSD: > >> > > 128GB RAM > >> > > 2x 200GB SSD for OS > >> > > 2x 400GB P3700 for Journals > >> > > 24x 6TB Enterprise SATA > >> > > 2x E5-2660v4 > >> > > 1x Dual Port 40Gb Ethernet > >> > > > >> > > _______________________________________________ > >> > > ceph-users mailing list > >> > > ceph-users@xxxxxxxxxxxxxx > >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > >> > > > >> > >> > >> -- > >> Christian Balzer Network/Systems Engineer > >> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >> http://www.gol.com/ > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com