Re: pros/cons of multiple OSD's per host

Nick Tan <nick.tan@xxxxxxxxx> · Tue, 22 Aug 2017 16:51:47 +0800

Hi Christian,

> Hi David,

>

> The planned usage for this CephFS cluster is scratch space for an image

> processing cluster with 100+ processing nodes.

Lots of clients, how much data movement would you expect, how many images

come in per timeframe, lets say an hour?

Typical size of a image?

Does an image come in and then gets processed by one processing node?

Unlikely to be touched again, at least in the short term?

Probably being deleted after being processed?

We'd typically get up to 6TB of raw imagery per day at an average image size of 20MB.  There's a complex multi stage processing chain that happens - typically images are read by multiple nodes with intermediate data generated and processed again by multiple nodes.  This would generate about 30TB of intermediate data.  The end result would be around 9TB of final processed data.  Once the processing is complete and the final data is copied off and completed QA, the entire data set is deleted.  The data sets could remain on the file system for up to 2 weeks before deletion.

>  My thinking is we'd be

> better off with a large number (100+) of storage hosts with 1-2 OSD's each,

> rather than 10 or so storage nodes with 10+ OSD's to get better parallelism

> but I don't have any practical experience with CephFS to really judge.

CephFS is one thing (of which I have very limited experience), but at this

point you're talking about parallelism in Ceph (RBD).

And that happens much more on an OSD than host level.

Which you _can_ achieve with larger nodes, if they're well designed.

Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in

"harmony".

I'm not sure what you mean about the RBD reference.  Does CephFS use RBD internally?

Also you keep talking about really huge HDDs, you could do worse than

halving their size and doubling their numbers to achieve much more

bandwidth and the ever crucial IOPS (even in your use case).

So something like 20x 12 HDD servers, with SSDs/NVMes for journal/bluestore

wAL/DB if you can afford or actually need it.

CephFS metadata on a SSD pool isn't the most dramatic improvement one can

do (or so people tell me), but given your budget it may be worthwhile.

Yes, I totally get the benefits of using greater numbers of smaller HDD's.  One of the requirements is to keep $/TB low and large capacity drives helps with that.  I guess we need to look at the tradeoff of $/TB vs number of spindles for performance.

If CephFS's parallelism happens more at the OSD level than the host level then perhaps the 12 disk storage host would be fine as long as "mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS and Network bandwidth on the host.  I'm doing some cost comparisons of these "big" servers vs multiple "small" servers such as the supermicro microcloud chassis or the Ambedded Mars 200 ARM cluster (which looks very interesting).  However, cost is not the sole consideration, so I'm hoping to get an idea of performance differences between the two architectures to help with the decision making process given the lack of test equipment available.

> And

> I don't have enough hardware to setup a test cluster of any significant

> size to run some actual testing.

>

You may want to set up something to get a feeling for CephFS, if it's

right for you or if something else on top of RBD may be more suitable.

I've setup a 3 node cluster, 2 OSD servers and 1 mon/mds to get a feel for ceph and cephFS.  It looks pretty straightforward and performs well enough given the lack of nodes.

Thanks,
Nick

Christian

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Rakuten Communications

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com