Re: cephfs file layouts, empty objects in first data pool

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Mon, 10 Feb 2020 17:04:40 -0500

I was also confused by this topic and had intended to post a question 
this week.  The documentation I recall reading said something about 'if 
you want to use erasure coding on a CephFS, you should use a small 
replicated data pool as the first pool, and your erasure coded pool as 
the second.'  I did not see any obvious indication of how this would 
'auto-magically' put the small files in the replicated pool and the 
large files in the erasure pool. although this sounds like a desirable 
behavior.  Instead I found the notes 'file layouts' which doesn't seem 
to be able to use size as a criterion.

Does anybody have anything further to add that would help clarify this?

Thanks.

-Dave

Dave Hall
Binghamton University

On 2/10/20 1:26 PM, Gregory Farnum wrote:
On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson <f96hajo@xxxxxxxxxxx> wrote:

On Mon, 10 Feb 2020, Gregory Farnum wrote:

On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson <f96hajo@xxxxxxxxxxx> wrote:

       Hi,

       running 14.2.6, debian buster (backports).

       Have set up a cephfs with 3 data pools and one metadata pool:
       myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.

       The data of all files are with the use of ceph.dir.layout.pool either
       stored in the pools myfs_data_hdd or myfs_data_ssd.  This has also been
       checked by dumping the ceph.file.layout.pool attributes of all files.

       The filesystem has 1617949 files and 36042 directories.

       There are however approximately as many objects in the first pool created
       for the cephfs, myfs_data, as there are files.  They also becomes more or
       fewer as files are created or deleted (so cannot be some leftover from
       earlier exercises).  Note how the USED size is reported as 0 bytes,
       correctly reflecting that no file data is stored in them.

       POOL_NAME        USED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED  RD_OPS      RD   WR_OPS      WR USED COMPR UNDER COMPR
       myfs_data         0 B 1618229      0 4854687                  0       0        0 2263590 129 GiB 23312479 124 GiB        0 B         0 B
       myfs_data_hdd 831 GiB  136309      0  408927                  0       0        0  106046 200 GiB   269084 277 GiB        0 B         0 B
       myfs_data_ssd  43 GiB 1552412      0 4657236                  0       0        0  181468 2.3 GiB  4661935  12 GiB        0 B         0 B
       myfs_metadata 1.2 GiB   36096      0  108288                  0       0        0 4828623  82 GiB  1355102 143 GiB        0 B         0 B

       Is this expected?

       I was assuming that in this scenario, all objects, both their data and any
       keys would be either in the metadata pool, or the two pools where the
       objects are stored.

       Is it some additional metadata keys that are stored in the first
       created data pool for cephfs?  This would not be so nice in case the osd
       selection rules for it are using worse disks than the data itself...

https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-to-the-mds notes there is “a small amount of metadata” kept in the primary pool.
Thanks!  This I managed to miss, probably as it was at the bottom of the
page.  In case one wants to use layouts to separate fast (likely many)
from slow (likely large) files, it then sounds as the primary pool should
the fast kind too, due to the large amount of objects.  Thus this needs to
be highlighted early in that documentation.

That’s not terribly clear; what is actually stored is a per-file location backtrace (its location in the directory tree) used for hardlink lookups and disaster recovery
scenarios.
This info would be nice to add to the manual page.  It is nice to know
what kind of information is stored there.
Yeah, PRs welcome. :p
Just to be clear though, that shouldn't be performance-critical. It's
lazily updated by the MDS when the directory location changes, but not
otherwise.

Again thanks for the clarification!

       Btw: is there any tool to see the amount of key value data size associated
       with a pool?  'ceph osd df' gives omap and meta for osds, but not broken
       down per pool.

I think this is in the newest master code, but I’m not certain which release it’s in...
Would it then (when available) also be in the 'rados df' command?
I really don't remember how everything is shared out but I think so?

Best regards,
Håkan

-Greg

       Best regards,
       Håkan
       _______________________________________________
       ceph-users mailing list -- ceph-users@xxxxxxx
       To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx