Ceph and my use case - is it a fit?

dev.matan@xxxxxxxxx (Matan Safriel) · Sat, 2 Aug 2014 15:44:53 +0300

Thanks John,

I really mean my files are too small for HDFS, as the majority of them will
be under 64M, which I think is (still?) the default HDFS block size, *and
also,* they will be very numerous.

As such, they would quickly consume a huge aggregate amount of RAM on the
HDFS name node, which is designed to store a certain amount of bytes per
file.
The name node in that sense it may seem, had been initially designed to
"manage" for a collection of huge files, not a huge collection of small
files. Or at least it may seem from documentation it's not optimized for
that.

A constructive approach may suggest I'd just have to allocate a large
server instance for the HDFS name node, which may a first step on a path
towards learning the next bottleneck using HDFS for such files, the hard /
long way.

Yes, I am aware HDFS has some special dedicated API for handling small
files, and some community wrappers for managing with small files, but they
seem a bit hackish, or feel like "too many moving parts" for a simple
scenario.

What do you think, and what do you think about Ceph for this scenario?

Thanks in advance!
Matan

On Thu, Jul 31, 2014 at 7:20 PM, John Spray <john.spray at redhat.com> wrote:

> On Wed, Jul 30, 2014 at 5:08 PM, Matan Safriel <dev.matan at gmail.com>
> wrote:
> > I'm looking for a distributed file system, for large JSON documents. My
> file
> > sizes are roughly between 20M and 100M, so they are too small for
> couchbase,
> > mongodb, even possibly Riak, but too small (by an order of magnitude) for
> > HDFS. Would you recommend Ceph for this kind of scenario?
>
> When you say they're too small for HDFS, do you really mean they're
> too numerous?  How many are we talking about?
>
> If your use case calls for just puts and gets of named serialized
> blobs, you may be best off with the RGW or librados object store
> interfaces to Ceph, rather than the file system per se.
>
> > Additional question - will it also install and behave gracefully as a
> > single-node cluster running on a single linux machine, in a dev scenario
> > and/or a unit test machine scenario?
>
> Yes, that's how some of the ceph tests themselves operate.
>
> Cheers,
> John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140802/338ed40d/attachment.htm>