Thanks for the advice Christian. I think I'm leaning more towards the 'traditional' storage server with 12 disks - as you say they give a lot more flexibility with the performance tuning/network options etc.
The cache pool is an interesting idea but as you say it can get quite expensive for the capacities we're looking at. I'm interested in how bluestore performs without a flash/SSD WAL/DB. In my small scale testing it seems much better than filestore so I was planning on building something without any flash/SSD. There's always the option of adding it later if required.
Thanks,
Nick
On Tue, Aug 22, 2017 at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Tue, 22 Aug 2017 16:51:47 +0800 Nick Tan wrote:
> Hi Christian,
>
>
>
> > > Hi David,
> > >
> > > The planned usage for this CephFS cluster is scratch space for an image
> > > processing cluster with 100+ processing nodes.
> >
> > Lots of clients, how much data movement would you expect, how many images
> > come in per timeframe, lets say an hour?
> > Typical size of a image?
> >
> > Does an image come in and then gets processed by one processing node?
> > Unlikely to be touched again, at least in the short term?
> > Probably being deleted after being processed?
> >
>
> We'd typically get up to 6TB of raw imagery per day at an average image
> size of 20MB. There's a complex multi stage processing chain that happens
> - typically images are read by multiple nodes with intermediate data
> generated and processed again by multiple nodes. This would generate about
> 30TB of intermediate data. The end result would be around 9TB of final
> processed data. Once the processing is complete and the final data is
> copied off and completed QA, the entire data set is deleted. The data sets
> could remain on the file system for up to 2 weeks before deletion.
>
If this is a more or less sequential processes w/o too many spikes, a hot
(daily) SSD pool or cache-tier may work wonders.
45TB of flash storage would be a bit spendy, though.
630TB total, lets call it 800, that's already 20 nodes with 12x 10TB HDDs.
>
>
> > > My thinking is we'd be
> > > better off with a large number (100+) of storage hosts with 1-2 OSD's
> > each,
> > > rather than 10 or so storage nodes with 10+ OSD's to get better
> > parallelism
> > > but I don't have any practical experience with CephFS to really judge.
> > CephFS is one thing (of which I have very limited experience), but at this
> > point you're talking about parallelism in Ceph (RBD).
> > And that happens much more on an OSD than host level.
> >
> > Which you _can_ achieve with larger nodes, if they're well designed.
> > Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> > "harmony".
> >
>
> I'm not sure what you mean about the RBD reference. Does CephFS use RBD
> internally?
>
RADOS, the underlying layer.
>
> >
> > Also you keep talking about really huge HDDs, you could do worse than
> > halving their size and doubling their numbers to achieve much more
> > bandwidth and the ever crucial IOPS (even in your use case).
> >
> > So something like 20x 12 HDD servers, with SSDs/NVMes for journal/bluestore
> > wAL/DB if you can afford or actually need it.
> >
> > CephFS metadata on a SSD pool isn't the most dramatic improvement one can
> > do (or so people tell me), but given your budget it may be worthwhile.
> >
> >
> Yes, I totally get the benefits of using greater numbers of smaller HDD's.
> One of the requirements is to keep $/TB low and large capacity drives helps
> with that. I guess we need to look at the tradeoff of $/TB vs number of
> spindles for performance.
>
Again, if it's mostly sequential the IOPS needs will be of course very
different from a scenario where you get 100 images coming in at once while
the processing nodes are munching on previous 100 ones.
> If CephFS's parallelism happens more at the OSD level than the host level
> then perhaps the 12 disk storage host would be fine as long as
> "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS and The later at least avoids the usual issue of underpowered and high latency
> Network bandwidth on the host. I'm doing some cost comparisons of these
> "big" servers vs multiple "small" servers such as the supermicro microcloud
> chassis or the Ambedded Mars 200 ARM cluster (which looks very
> interesting).
networking with these kinds of designs (one from Supermicro comes to
mind) tend to have, but 2GB RAM and CPU feel... weak
Also you will have to buy an SSD for each in case you want/need journals
(or fast WAL/DB with bluestore).
Spendy and massively annoying if anything fails with these things (no
hot-swap).
> However, cost is not the sole consideration, so I'm hoping
> to get an idea of performance differences between the two architectures to
> help with the decision making process given the lack of test equipment
> available.
>
If you compare the above bits, they should perform withing the same
ballpark when it comes to sequential operations.
Bit is a lot easier to beef up a medium sized node (to a point) then
something like those high density solutions.
With a larger node you have the option to go for 25Gb/s (lower latency)
NICs easily, with just 12 HDDs keep it to one NUMA node (also look at the
upcoming AMD Epyc stuff) with fast cores (lower latency again) and enough
RAM to have significant page cache effects AND even more importantly keep
SLAB data like inodes in RAM.
Which reminds me, I don't have the faintest idea how this (lots of RAM)
will apply to or help with Bluestore....
Christian
>
>
> >
> > > And
> > > I don't have enough hardware to setup a test cluster of any significant
> > > size to run some actual testing.
> > >
> > You may want to set up something to get a feeling for CephFS, if it's
> > right for you or if something else on top of RBD may be more suitable.
> >
> >
> I've setup a 3 node cluster, 2 OSD servers and 1 mon/mds to get a feel for
> ceph and cephFS. It looks pretty straightforward and performs well enough
> given the lack of nodes.
>
>
> Thanks,
> Nick
>
>
> > Christian
> > --
> > Christian Balzer Network/Systems Engineer
> > chibi@xxxxxxx Rakuten Communications
> >
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Rakuten Communications
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com