Understanding EC properties for CephFS / small files.

jesper@xxxxxxxx · Sun, 17 Feb 2019 08:10:39 +0100

Hi List.

I'm trying to understand the nuts and bolts of EC / CephFS
We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
slow bulk / archive storage.

# getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup
getfattr: Removing leading '/' from absolute path names
# file: mnt/home/cluster/mysqlbackup
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
pool=cephfs_data_ec42"

This configuration is taken directly out of the online documentation:
(Which may have been where it went all wrong from our perspective):

http://docs.ceph.com/docs/master/cephfs/file-layouts/

Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each
with 2 erasure coding chuncks? I dont really understand the stripe_count
element?

And since erasure-coding works at the object level, striping individual
objects across - here 4 replicas - it'll end up filling 16MB ? Or
is there an internal optimization causing this not to be the case?

Additionally, when reading the file, all 4 chunck need to be read to
assemble the object. Causing (at a minumum) 4 IOPS per file.

Now, my common file size is < 8MB and commonly 512KB files are on
this pool.

Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks
to fill the erasure coded profile and then 2 coding chuncks on top?
In total 24MB for storing 512KB ?

And when reading it I'll hit 4 random IO's to read 512KB or can
it optimize around not reading "empty" chuncks?

If this is true, then I would be both performance and space/cost-wise
way better off with 3x replication.

Or is it less worse than what I get to here?

If the math is true, then we can begin to calculate chunksize and
EC profiles for when EC begins to deliver benefits.

In terms of IO it seems like I'll always suffer a 1:4 ratio on IOPS in
a reading scenario on a 4+2 EC pool, compared to a 3x replication.

Side-note: I'm trying to get bacula (tape-backup) to read off my archive
to tape in a "resonable time/speed".

Thanks in advance.

-- 
Jesper

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com