Hi Greg,
On 10/17/2017 11:49 PM, Gregory Farnum
wrote:
On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky < jiri.horky@xxxxxxxxx>
wrote:
Hi list,
we are thinking of building relatively big CEPH-based object
storage for
storage of our sample files - we have about 700M files
ranging from very
small (1-4KiB) files to pretty big ones (several GiB).
Median of file
size is 64KiB. Since the required space is relatively large
(1PiB of
usable storage), we are thinking of utilizing erasure coding
for this
case. On the other hand, we need to achieve at least
1200MiB/s
throughput on reads. The working assumption is 4+2 EC (thus
50% overhead).
Since the EC is per-object, the small objects will be
stripped to even
smaller ones. With 4+2 EC, one needs (at least) 4 IOs to
read a single
object in this scenario -> number of required IOPS when
using EC is
relatively high. Some vendors (such as Hitachi, but I
believe EMC as
well) do offline, predefined-chunk size EC instead. The idea
is to first
write objects with replication factor of 3, wait for enough
objects to
fill 4x 64MiB chunks and only do EC on that. This not only
makes the EC
less computationally intensive, and repairs much faster, but
it also
allows reading majority of the small objects directly by
reading just
part of one of the chunk from it (assuming non degraded
state) - one
chunk actually contains the whole object.
I wonder if something similar is already possible with CEPH
and/or is
planned. For our use case of very small objects, it would
mean near 3-4x
performance boosts in terms of required IOPS performance.
Another option how to get out of this situation is to be
able to specify
different storage pools/policies based on file size - i.e.
to do 3x
replication of the very small files and only use EC for
bigger files,
where the performance hit with 4x IOPS won't be that
painful. But I I am
afraid this is not possible...
Unfortunately any logic like this would need to be
handled in your application layer. Raw RADOS does not do
object sharding or aggregation on its own.
CERN did contribute the libradosstriper, which will break
down your multi-gigabyte objects into more typical sizes,
but a generic system for packing many small objects into
larger ones is tough — the choices depend so much on likely
access patterns and such.
I would definitely recommend working out something like
that, though!
-Greg
this is unfortunate. I believe that for storage of small objects,
this would be a deal breaker. Hitachi claims they can do 20+6
erasure coding when using predefined-size EC, which is something
hardly imaginable on the current CEPH implementation. Actually, for
us, I am afraid that lack of this feature actually mean we would buy
an object store instead of building it on open source technology :-/
From technical side, I don't see why the access pattern of such
objects would change the storage strategy. If you would leave the
bulk blocksize configurable, it should be enough, shouldn't it?
Regards
Jiri Horky
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com