Re: Spurious empty files in CephFS root pool when multiple pools associated

John Spray <jspray@xxxxxxxxxx> · Tue, 3 Jul 2018 12:46:00 +0100

On Tue, Jul 3, 2018 at 12:24 PM Jesus Cea <jcea@xxxxxxx> wrote:
>
> On 03/07/18 13:08, John Spray wrote:
> > Right: as you've noticed, they're not spurious, they're where we keep
> > a "backtrace" xattr for a file.
> >
> > Backtraces are lazily updated paths, that enable CephFS to map an
> > inode number to a file's metadata, which is needed when resolving hard
> > links or NFS file handles.  The trouble with the backtrace in the
> > individual data pools is that the MDS would have to scan through the
> > pools to find it, so instead all the files get a backtrace in the root
> > pool.
>
> Given this, the warning "1 pools have many more objects per pg than
> average" will happen ALWAYS. Is there any plan to do some kind of
> special case for this situation or a "mon_pg_warn_max_object_skew"
> override will be needed forever?.

The "more objects per pg than average" warning is based on the idea
that there is some approximate ratio of objects to PGs that is
desirable, but Ceph doesn't know what that ratio is, so Ceph is
assuming that you've got roughly the right ratio overall, and any pool
10x denser than that is a problem.

To directly address that warning rather than silencing it, you'd
increase the number of PGs in your primary data pool.

There's a conflict here between pools with lots of data (where the MB
per PG might be the main concern, not the object size), vs.
metadata-ish pools (where the object counter per PG is the main
concern).  Maybe it doesn't really make sense to group them all
together when calculating the average object-per-pg count that's used
in this health warning -- I'll bring that up over on ceph-devel in a
moment.

John

>
> >> Should this data be stored in the metadata pool?
> >
> > Yes, probably.  As you say, it's ugly how we end up with these extra
> [...]
> > One option would be to do both by default: write the backtrace to the
> > metadata pool for it's ordinary functional lookup purpose, but also
> > write it back to the data pool as an intentionally redundant
> > resilience measure.  The extra write to the data pool could be
> > disabled by anyone who wants to save the IOPS at the cost of some
> > resilience.
> This would be nice. Another option would be simply use less objects, but
> I guess that could be a major change.
>
> Actually, my main issue here is the warning "1 pools have many more
> objects per pg than average". My cluster is permanently in WARNING
> state, with the known consequences, and my "mon_pg_warn_max_object_skew"
> override is not working, for some reason. I am using Ceph 12.2.5.
>
>
> Thanks for your time and expertise!
>
> --
> Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
> jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
> Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
> jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
> "Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
> "My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com