Re: pros/cons of multiple OSD's per host

Christian Balzer <chibi@xxxxxxx> · Wed, 23 Aug 2017 15:28:53 +0900

On Wed, 23 Aug 2017 13:38:25 +0800 Nick Tan wrote:

> Thanks for the advice Christian.  I think I'm leaning more towards the
> 'traditional' storage server with 12 disks - as you say they give a lot
> more flexibility with the performance tuning/network options etc.
> 
> The cache pool is an interesting idea but as you say it can get quite
> expensive for the capacities we're looking at.  I'm interested in how
> bluestore performs without a flash/SSD WAL/DB.  In my small scale testing
> it seems much better than filestore so I was planning on building something
> without any flash/SSD.  There's always the option of adding it later if
> required.
> 
Given the lack (for large writes) of double writes with Bluestore that's
to be expected. 

Since you're looking mostly at largish, sequential writes and reads, a
pure HDDs cluster may be feasible. 

Christian

> Thanks,
> Nick
> 
> On Tue, Aug 22, 2017 at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Tue, 22 Aug 2017 16:51:47 +0800 Nick Tan wrote:
> >  
> > > Hi Christian,
> > >
> > >
> > >  
> > > > > Hi David,
> > > > >
> > > > > The planned usage for this CephFS cluster is scratch space for an  
> > image  
> > > > > processing cluster with 100+ processing nodes.  
> > > >
> > > > Lots of clients, how much data movement would you expect, how many  
> > images  
> > > > come in per timeframe, lets say an hour?
> > > > Typical size of a image?
> > > >
> > > > Does an image come in and then gets processed by one processing node?
> > > > Unlikely to be touched again, at least in the short term?
> > > > Probably being deleted after being processed?
> > > >  
> > >
> > > We'd typically get up to 6TB of raw imagery per day at an average image
> > > size of 20MB.  There's a complex multi stage processing chain that  
> > happens  
> > > - typically images are read by multiple nodes with intermediate data
> > > generated and processed again by multiple nodes.  This would generate  
> > about  
> > > 30TB of intermediate data.  The end result would be around 9TB of final
> > > processed data.  Once the processing is complete and the final data is
> > > copied off and completed QA, the entire data set is deleted.  The data  
> > sets  
> > > could remain on the file system for up to 2 weeks before deletion.
> > >  
> >
> > If this is a more or less sequential processes w/o too many spikes, a hot
> > (daily) SSD pool or cache-tier may work wonders.
> > 45TB of flash storage would be a bit spendy, though.
> >
> > 630TB total, lets call it 800, that's already 20 nodes with 12x 10TB HDDs.
> >  
> > >
> > >  
> > > > >  My thinking is we'd be
> > > > > better off with a large number (100+) of storage hosts with 1-2 OSD's  
> > > > each,  
> > > > > rather than 10 or so storage nodes with 10+ OSD's to get better  
> > > > parallelism  
> > > > > but I don't have any practical experience with CephFS to really  
> > judge.  
> > > > CephFS is one thing (of which I have very limited experience), but at  
> > this  
> > > > point you're talking about parallelism in Ceph (RBD).
> > > > And that happens much more on an OSD than host level.
> > > >
> > > > Which you _can_ achieve with larger nodes, if they're well designed.
> > > > Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> > > > "harmony".
> > > >  
> > >
> > > I'm not sure what you mean about the RBD reference.  Does CephFS use RBD
> > > internally?
> > >  
> > RADOS, the underlying layer.
> >  
> > >  
> > > >
> > > > Also you keep talking about really huge HDDs, you could do worse than
> > > > halving their size and doubling their numbers to achieve much more
> > > > bandwidth and the ever crucial IOPS (even in your use case).
> > > >
> > > > So something like 20x 12 HDD servers, with SSDs/NVMes for  
> > journal/bluestore  
> > > > wAL/DB if you can afford or actually need it.
> > > >
> > > > CephFS metadata on a SSD pool isn't the most dramatic improvement one  
> > can  
> > > > do (or so people tell me), but given your budget it may be worthwhile.
> > > >
> > > >  
> > > Yes, I totally get the benefits of using greater numbers of smaller  
> > HDD's.  
> > > One of the requirements is to keep $/TB low and large capacity drives  
> > helps  
> > > with that.  I guess we need to look at the tradeoff of $/TB vs number of
> > > spindles for performance.
> > >  
> > Again, if it's mostly sequential the IOPS needs will be of course very
> > different from a scenario where you get 100 images coming in at once while
> > the processing nodes are munching on previous 100 ones.
> >  
> > > If CephFS's parallelism happens more at the OSD level than the host level
> > > then perhaps the 12 disk storage host would be fine as long as
> > > "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS  
> > and  
> > > Network bandwidth on the host.  I'm doing some cost comparisons of these
> > > "big" servers vs multiple "small" servers such as the supermicro  
> > microcloud  
> > > chassis or the Ambedded Mars 200 ARM cluster (which looks very
> > > interesting).  
> > The later at least avoids the usual issue of underpowered and high latency
> > networking  with these kinds of designs (one from Supermicro comes to
> > mind) tend to have, but 2GB RAM and CPU feel... weak
> >
> > Also you will have to buy an SSD for each in case you want/need journals
> > (or fast WAL/DB with bluestore).
> > Spendy and massively annoying if anything fails with these things (no
> > hot-swap).
> >  
> > > However, cost is not the sole consideration, so I'm hoping
> > > to get an idea of performance differences between the two architectures  
> > to  
> > > help with the decision making process given the lack of test equipment
> > > available.
> > >  
> > If you compare the above bits, they should perform withing the same
> > ballpark when it comes to sequential operations.
> > Bit is a lot easier to beef up a medium sized node (to a point) then
> > something like those high density solutions.
> >
> > With a larger node you have the option to go for 25Gb/s (lower latency)
> > NICs easily, with just 12 HDDs keep it to one NUMA node (also look at the
> > upcoming AMD Epyc stuff) with fast cores (lower latency again) and enough
> > RAM to have significant page cache effects AND even more importantly keep
> > SLAB data like inodes in RAM.
> >
> > Which reminds me, I don't have the faintest idea how this (lots of RAM)
> > will apply to or help with Bluestore....
> >
> >
> > Christian
> >  
> > >
> > >  
> > > >  
> > > > > And
> > > > > I don't have enough hardware to setup a test cluster of any  
> > significant  
> > > > > size to run some actual testing.
> > > > >  
> > > > You may want to set up something to get a feeling for CephFS, if it's
> > > > right for you or if something else on top of RBD may be more suitable.
> > > >
> > > >  
> > > I've setup a 3 node cluster, 2 OSD servers and 1 mon/mds to get a feel  
> > for  
> > > ceph and cephFS.  It looks pretty straightforward and performs well  
> > enough  
> > > given the lack of nodes.
> > >
> > >
> > > Thanks,
> > > Nick
> > >
> > >  
> > > > Christian
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx           Rakuten Communications
> > > >  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> >  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com