On Wed, 23 Aug 2017 13:38:25 +0800 Nick Tan wrote: > Thanks for the advice Christian. I think I'm leaning more towards the > 'traditional' storage server with 12 disks - as you say they give a lot > more flexibility with the performance tuning/network options etc. > > The cache pool is an interesting idea but as you say it can get quite > expensive for the capacities we're looking at. I'm interested in how > bluestore performs without a flash/SSD WAL/DB. In my small scale testing > it seems much better than filestore so I was planning on building something > without any flash/SSD. There's always the option of adding it later if > required. > Given the lack (for large writes) of double writes with Bluestore that's to be expected. Since you're looking mostly at largish, sequential writes and reads, a pure HDDs cluster may be feasible. Christian > Thanks, > Nick > > On Tue, Aug 22, 2017 at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > Hello, > > > > On Tue, 22 Aug 2017 16:51:47 +0800 Nick Tan wrote: > > > > > Hi Christian, > > > > > > > > > > > > > > Hi David, > > > > > > > > > > The planned usage for this CephFS cluster is scratch space for an > > image > > > > > processing cluster with 100+ processing nodes. > > > > > > > > Lots of clients, how much data movement would you expect, how many > > images > > > > come in per timeframe, lets say an hour? > > > > Typical size of a image? > > > > > > > > Does an image come in and then gets processed by one processing node? > > > > Unlikely to be touched again, at least in the short term? > > > > Probably being deleted after being processed? > > > > > > > > > > We'd typically get up to 6TB of raw imagery per day at an average image > > > size of 20MB. There's a complex multi stage processing chain that > > happens > > > - typically images are read by multiple nodes with intermediate data > > > generated and processed again by multiple nodes. This would generate > > about > > > 30TB of intermediate data. The end result would be around 9TB of final > > > processed data. Once the processing is complete and the final data is > > > copied off and completed QA, the entire data set is deleted. The data > > sets > > > could remain on the file system for up to 2 weeks before deletion. > > > > > > > If this is a more or less sequential processes w/o too many spikes, a hot > > (daily) SSD pool or cache-tier may work wonders. > > 45TB of flash storage would be a bit spendy, though. > > > > 630TB total, lets call it 800, that's already 20 nodes with 12x 10TB HDDs. > > > > > > > > > > > > > My thinking is we'd be > > > > > better off with a large number (100+) of storage hosts with 1-2 OSD's > > > > each, > > > > > rather than 10 or so storage nodes with 10+ OSD's to get better > > > > parallelism > > > > > but I don't have any practical experience with CephFS to really > > judge. > > > > CephFS is one thing (of which I have very limited experience), but at > > this > > > > point you're talking about parallelism in Ceph (RBD). > > > > And that happens much more on an OSD than host level. > > > > > > > > Which you _can_ achieve with larger nodes, if they're well designed. > > > > Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in > > > > "harmony". > > > > > > > > > > I'm not sure what you mean about the RBD reference. Does CephFS use RBD > > > internally? > > > > > RADOS, the underlying layer. > > > > > > > > > > > > > Also you keep talking about really huge HDDs, you could do worse than > > > > halving their size and doubling their numbers to achieve much more > > > > bandwidth and the ever crucial IOPS (even in your use case). > > > > > > > > So something like 20x 12 HDD servers, with SSDs/NVMes for > > journal/bluestore > > > > wAL/DB if you can afford or actually need it. > > > > > > > > CephFS metadata on a SSD pool isn't the most dramatic improvement one > > can > > > > do (or so people tell me), but given your budget it may be worthwhile. > > > > > > > > > > > Yes, I totally get the benefits of using greater numbers of smaller > > HDD's. > > > One of the requirements is to keep $/TB low and large capacity drives > > helps > > > with that. I guess we need to look at the tradeoff of $/TB vs number of > > > spindles for performance. > > > > > Again, if it's mostly sequential the IOPS needs will be of course very > > different from a scenario where you get 100 images coming in at once while > > the processing nodes are munching on previous 100 ones. > > > > > If CephFS's parallelism happens more at the OSD level than the host level > > > then perhaps the 12 disk storage host would be fine as long as > > > "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS > > and > > > Network bandwidth on the host. I'm doing some cost comparisons of these > > > "big" servers vs multiple "small" servers such as the supermicro > > microcloud > > > chassis or the Ambedded Mars 200 ARM cluster (which looks very > > > interesting). > > The later at least avoids the usual issue of underpowered and high latency > > networking with these kinds of designs (one from Supermicro comes to > > mind) tend to have, but 2GB RAM and CPU feel... weak > > > > Also you will have to buy an SSD for each in case you want/need journals > > (or fast WAL/DB with bluestore). > > Spendy and massively annoying if anything fails with these things (no > > hot-swap). > > > > > However, cost is not the sole consideration, so I'm hoping > > > to get an idea of performance differences between the two architectures > > to > > > help with the decision making process given the lack of test equipment > > > available. > > > > > If you compare the above bits, they should perform withing the same > > ballpark when it comes to sequential operations. > > Bit is a lot easier to beef up a medium sized node (to a point) then > > something like those high density solutions. > > > > With a larger node you have the option to go for 25Gb/s (lower latency) > > NICs easily, with just 12 HDDs keep it to one NUMA node (also look at the > > upcoming AMD Epyc stuff) with fast cores (lower latency again) and enough > > RAM to have significant page cache effects AND even more importantly keep > > SLAB data like inodes in RAM. > > > > Which reminds me, I don't have the faintest idea how this (lots of RAM) > > will apply to or help with Bluestore.... > > > > > > Christian > > > > > > > > > > > > > > > > > And > > > > > I don't have enough hardware to setup a test cluster of any > > significant > > > > > size to run some actual testing. > > > > > > > > > You may want to set up something to get a feeling for CephFS, if it's > > > > right for you or if something else on top of RBD may be more suitable. > > > > > > > > > > > I've setup a 3 node cluster, 2 OSD servers and 1 mon/mds to get a feel > > for > > > ceph and cephFS. It looks pretty straightforward and performs well > > enough > > > given the lack of nodes. > > > > > > > > > Thanks, > > > Nick > > > > > > > > > > Christian > > > > -- > > > > Christian Balzer Network/Systems Engineer > > > > chibi@xxxxxxx Rakuten Communications > > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Rakuten Communications > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com