Re: mon_pg_warn_max_object_skew logic (was: [ceph-users] Spurious empty files in CephFS root pool when multiple pools associated)

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 3 Jul 2018 12:12:04 +0000 (UTC)

On Tue, 3 Jul 2018, John Spray wrote:
> On Tue, Jul 3, 2018 at 12:46 PM John Spray <jspray@xxxxxxxxxx> wrote:
> >
> > On Tue, Jul 3, 2018 at 12:24 PM Jesus Cea <jcea@xxxxxxx> wrote:
> > >
> > > On 03/07/18 13:08, John Spray wrote:
> > > > Right: as you've noticed, they're not spurious, they're where we keep
> > > > a "backtrace" xattr for a file.
> > > >
> > > > Backtraces are lazily updated paths, that enable CephFS to map an
> > > > inode number to a file's metadata, which is needed when resolving hard
> > > > links or NFS file handles.  The trouble with the backtrace in the
> > > > individual data pools is that the MDS would have to scan through the
> > > > pools to find it, so instead all the files get a backtrace in the root
> > > > pool.
> > >
> > > Given this, the warning "1 pools have many more objects per pg than
> > > average" will happen ALWAYS. Is there any plan to do some kind of
> > > special case for this situation or a "mon_pg_warn_max_object_skew"
> > > override will be needed forever?.
> >
> > The "more objects per pg than average" warning is based on the idea
> > that there is some approximate ratio of objects to PGs that is
> > desirable, but Ceph doesn't know what that ratio is, so Ceph is
> > assuming that you've got roughly the right ratio overall, and any pool
> > 10x denser than that is a problem.
> >
> > To directly address that warning rather than silencing it, you'd
> > increase the number of PGs in your primary data pool.
> >
> > There's a conflict here between pools with lots of data (where the MB
> > per PG might be the main concern, not the object size), vs.
> > metadata-ish pools (where the object counter per PG is the main
> > concern).  Maybe it doesn't really make sense to group them all
> > together when calculating the average object-per-pg count that's used
> > in this health warning -- I'll bring that up over on ceph-devel in a
> > moment.
> 
> Migrating this topic from ceph-users for more input -- am I talking sense?
> 
> It seems wrong that we would look at the average object count per PG
> of pools containing big objects, and use it to validate the object
> count per PG of pools containing tiny objects.

Yeah, it's a pretty lame set of criteria for this warning most ways you 
look at it, I think.  This came up about a month ago on another ceph-users 
thread: https://www.spinics.net/lists/ceph-devel/msg41418.html

I wonder if we should either (1) increase the default value here by a big 
factor (5? 10?), or (2) remove this warning entirely since we're about to 
start auto-tuning PG counts anyway.

sage

> 
> John
> 
> >
> > John
> >
> > >
> > > >> Should this data be stored in the metadata pool?
> > > >
> > > > Yes, probably.  As you say, it's ugly how we end up with these extra
> > > [...]
> > > > One option would be to do both by default: write the backtrace to the
> > > > metadata pool for it's ordinary functional lookup purpose, but also
> > > > write it back to the data pool as an intentionally redundant
> > > > resilience measure.  The extra write to the data pool could be
> > > > disabled by anyone who wants to save the IOPS at the cost of some
> > > > resilience.
> > > This would be nice. Another option would be simply use less objects, but
> > > I guess that could be a major change.
> > >
> > > Actually, my main issue here is the warning "1 pools have many more
> > > objects per pg than average". My cluster is permanently in WARNING
> > > state, with the known consequences, and my "mon_pg_warn_max_object_skew"
> > > override is not working, for some reason. I am using Ceph 12.2.5.
> > >
> > >
> > > Thanks for your time and expertise!
> > >
> > > --
> > > Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
> > > jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
> > > Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
> > > jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
> > > "Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
> > > "My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
> > > "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>