Re: Understanding EC properties for CephFS / small files.

Paul Emmerich <paul.emmerich@xxxxxxxx> · Sun, 17 Feb 2019 23:40:19 +0100

File layouts in CephFS and erasure coding are unrelated as they happen
on a completely different layer.

CephFS will split files into multiple 4 MB RADOS objects by default,
this is completely independent from how RADOS then store these 4 MB
(or smaller) objects.

For your examples:

16 MB file -> 4x 4 MB objects -> 4x 4x 1 MB data chunks, 4x 2x 1 MB
coding chunks

512 kB file -> 1x 512 kB object -> 4x 128 kB data chunks, 2x 128 kb
coding chunks

You'll run into different problems once the erasure coded chunks end
up being smaller than 64kb each due to bluestore min allocation sizes
and general metadata overhead making erasure coding a bad fit for very
small files.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sun, Feb 17, 2019 at 8:11 AM <jesper@xxxxxxxx> wrote:
>
> Hi List.
>
> I'm trying to understand the nuts and bolts of EC / CephFS
> We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
> slow bulk / archive storage.
>
> # getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/home/cluster/mysqlbackup
> ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
> pool=cephfs_data_ec42"
>
> This configuration is taken directly out of the online documentation:
> (Which may have been where it went all wrong from our perspective):
>
> http://docs.ceph.com/docs/master/cephfs/file-layouts/
>
> Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each
> with 2 erasure coding chuncks? I dont really understand the stripe_count
> element?
>
> And since erasure-coding works at the object level, striping individual
> objects across - here 4 replicas - it'll end up filling 16MB ? Or
> is there an internal optimization causing this not to be the case?
>
> Additionally, when reading the file, all 4 chunck need to be read to
> assemble the object. Causing (at a minumum) 4 IOPS per file.
>
> Now, my common file size is < 8MB and commonly 512KB files are on
> this pool.
>
> Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks
> to fill the erasure coded profile and then 2 coding chuncks on top?
> In total 24MB for storing 512KB ?
>
> And when reading it I'll hit 4 random IO's to read 512KB or can
> it optimize around not reading "empty" chuncks?
>
> If this is true, then I would be both performance and space/cost-wise
> way better off with 3x replication.
>
> Or is it less worse than what I get to here?
>
> If the math is true, then we can begin to calculate chunksize and
> EC profiles for when EC begins to deliver benefits.
>
> In terms of IO it seems like I'll always suffer a 1:4 ratio on IOPS in
> a reading scenario on a 4+2 EC pool, compared to a 3x replication.
>
> Side-note: I'm trying to get bacula (tape-backup) to read off my archive
> to tape in a "resonable time/speed".
>
> Thanks in advance.
>
> --
> Jesper
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com