mon_pg_warn_max_object_skew logic (was: [ceph-users] Spurious empty files in CephFS root pool when multiple pools associated)

John Spray <jspray@xxxxxxxxxx> · Tue, 3 Jul 2018 12:50:52 +0100

On Tue, Jul 3, 2018 at 12:46 PM John Spray <jspray@xxxxxxxxxx> wrote:
>
> On Tue, Jul 3, 2018 at 12:24 PM Jesus Cea <jcea@xxxxxxx> wrote:
> >
> > On 03/07/18 13:08, John Spray wrote:
> > > Right: as you've noticed, they're not spurious, they're where we keep
> > > a "backtrace" xattr for a file.
> > >
> > > Backtraces are lazily updated paths, that enable CephFS to map an
> > > inode number to a file's metadata, which is needed when resolving hard
> > > links or NFS file handles.  The trouble with the backtrace in the
> > > individual data pools is that the MDS would have to scan through the
> > > pools to find it, so instead all the files get a backtrace in the root
> > > pool.
> >
> > Given this, the warning "1 pools have many more objects per pg than
> > average" will happen ALWAYS. Is there any plan to do some kind of
> > special case for this situation or a "mon_pg_warn_max_object_skew"
> > override will be needed forever?.
>
> The "more objects per pg than average" warning is based on the idea
> that there is some approximate ratio of objects to PGs that is
> desirable, but Ceph doesn't know what that ratio is, so Ceph is
> assuming that you've got roughly the right ratio overall, and any pool
> 10x denser than that is a problem.
>
> To directly address that warning rather than silencing it, you'd
> increase the number of PGs in your primary data pool.
>
> There's a conflict here between pools with lots of data (where the MB
> per PG might be the main concern, not the object size), vs.
> metadata-ish pools (where the object counter per PG is the main
> concern).  Maybe it doesn't really make sense to group them all
> together when calculating the average object-per-pg count that's used
> in this health warning -- I'll bring that up over on ceph-devel in a
> moment.

Migrating this topic from ceph-users for more input -- am I talking sense?

It seems wrong that we would look at the average object count per PG
of pools containing big objects, and use it to validate the object
count per PG of pools containing tiny objects.

John

>
> John
>
> >
> > >> Should this data be stored in the metadata pool?
> > >
> > > Yes, probably.  As you say, it's ugly how we end up with these extra
> > [...]
> > > One option would be to do both by default: write the backtrace to the
> > > metadata pool for it's ordinary functional lookup purpose, but also
> > > write it back to the data pool as an intentionally redundant
> > > resilience measure.  The extra write to the data pool could be
> > > disabled by anyone who wants to save the IOPS at the cost of some
> > > resilience.
> > This would be nice. Another option would be simply use less objects, but
> > I guess that could be a major change.
> >
> > Actually, my main issue here is the warning "1 pools have many more
> > objects per pg than average". My cluster is permanently in WARNING
> > state, with the known consequences, and my "mon_pg_warn_max_object_skew"
> > override is not working, for some reason. I am using Ceph 12.2.5.
> >
> >
> > Thanks for your time and expertise!
> >
> > --
> > Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
> > jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
> > Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
> > jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
> > "Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
> > "My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
> > "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html