Re: Spurious empty files in CephFS root pool when multiple pools associated

John Spray <jspray@xxxxxxxxxx> · Tue, 3 Jul 2018 12:08:11 +0100

On Tue, Jul 3, 2018 at 11:53 AM Jesus Cea <jcea@xxxxxxx> wrote:
>
> Hi there.
>
> I have an issue with cephfs and multiple datapools inside. I have like
> SIX datapools inside the cephfs, I control where files are stored using
> xattrs in the directories.
>
> The "root" directory only contains directories with "xattrs" requesting
> new objects to be stored in different pools. So good so far. The problem
> is that "root" datapool has a ghost file per file inside the cephfs,
> even when the object is actually stored in a different datapool. Taking
> no space at all, but counting against the number of objects in the cephfs.
>
> The files are empty (0 bytes) but they have xattrs saying in what pool
> the object is actually stored.

Right: as you've noticed, they're not spurious, they're where we keep
a "backtrace" xattr for a file.

Backtraces are lazily updated paths, that enable CephFS to map an
inode number to a file's metadata, which is needed when resolving hard
links or NFS file handles.  The trouble with the backtrace in the
individual data pools is that the MDS would have to scan through the
pools to find it, so instead all the files get a backtrace in the root
pool.

> Should this data be stored in the metadata pool?

Yes, probably.  As you say, it's ugly how we end up with these extra
objects in a multi-data-pool situation.  However, it's not always more
efficient, as when there is only a single data pool, it reduces the
overall object count to "piggy back" the backtrace onto the existing
data object.

The backtrace in the data pool has a secondary purpose: disaster
recovery.  If the metadata pool is damaged or lost entirely, the
backtrace enables cephfs-data-scan to do a pretty good job of
reconstructing an approximate filesystem tree that links back to the
files.

One option would be to do both by default: write the backtrace to the
metadata pool for it's ordinary functional lookup purpose, but also
write it back to the data pool as an intentionally redundant
resilience measure.  The extra write to the data pool could be
disabled by anyone who wants to save the IOPS at the cost of some
resilience.

John

. By comparison, my
> metadata pool is 244MB in size but it basically uses the same size when
> I had no objects and now, with 1.3 million objects: ~250 MB.
>
>     cephfsROOT_data         70         0         0          354G
> 1332929
>     cephfsROOT_metadata     71      244M      0.07          354G
> 1625
>     black_1                 72      944G     52.58          851G
> 241736
>     black_2                 73      944G     52.59          851G
> 241744
>     black_3                 74      953G     52.82          851G
> 243990
>     black_4                 75      934G     52.33          851G
> 239243
>     black_5                 76      944G     52.59          851G
> 241814
>     black_6                 77      531G     38.44          851G      136081
>
> Black_* are associated to the cephfs via "ceph fs add_data_pool XXX".
> For each object created inside a "black_*", a ghost empty object is
> created in "cephfsROOT_data".
>
> That huge number of files is causing other issues, like the warning "1
> pools have many more objects per pg than average".
>
> About that warning. I change the value of "mon_pg_warn_max_object_skew"
> and The warning is still there. Asking the monitors about it, they show
> the new value, but asking from client, it still shows "10":
>
> root@jcea:/var/run/ceph# ceph --admin-daemon ceph-client.jcea.asok --id
> jcea config show|grep skew
>     "mon_timecheck_skew_interval": "30",
>     "mon_pg_warn_max_object_skew": "10",
>
> I don't know how to proceed.
>
> --
> Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
> jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
> Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
> jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
> "Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
> "My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com