Re: [ceph-users] size of inc_osdmap vs osdmap

Sage Weil <sweil@xxxxxxxxxx> · Thu, 3 Jan 2019 03:03:20 +0000 (UTC)

I think that code was broken by 
ea723fbb88c69bd00fefd32a3ee94bf5ce53569c and should be fixed like so:

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 8376a40668..12f468636f 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -1006,7 +1006,8 @@ void OSDMonitor::prime_pg_temp(
   int next_up_primary, next_acting_primary;
   next.pg_to_up_acting_osds(pgid, &next_up, &next_up_primary,
                            &next_acting, &next_acting_primary);
-  if (acting == next_acting && next_up != next_acting)
+  if (acting == next_acting &&
+      !(up != acting && next_up == next_acting))
     return;  // no change since last epoch
 
   if (acting.empty())


The original intent was to clear out pg_temps during priming, but as 
written it would set a new_pg_temp item clearing the pg_temp even if one 
didn't already exist.  Adding the up != acting condition in there makes us 
only take that path if there is an existing pg_temp entry to remove.

Xie, does that sound right?

sage


On Thu, 3 Jan 2019, Sergey Dolgov wrote:

> >
> > Well those commits made some changes, but I'm not sure what about them
> > you're saying is wrong?
> >
> I mean,  that all pgs have "up == acting && next_up == next_acting" but at
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> condition
> "next_up != next_acting" false and we clear acting for all pgs at
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1018 after
> that all pg fall into inc_osdmap
> I think https://github.com/ceph/ceph/pull/25724 change behavior to
> correct(as was before commit
> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c)
> for pg with up == acting && next_up == next_acting
> 
> On Thu, Jan 3, 2019 at 2:13 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> 
> >
> >
> > On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov <palza00@xxxxxxxxx> wrote:
> >
> >> We investigated the issue and set debug_mon up to 20 during little change
> >> of osdmap get many messages for all pgs of each pool (for all cluster):
> >>
> >>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
> >>> []
> >>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
> >>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
> >>
> >> though no pg_temps are created as result(no single backfill)
> >>
> >> We suppose this behavior changed in commit
> >> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
> >> because earlier function *OSDMonitor::prime_pg_temp* should return in
> >> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> >> like in jewel
> >> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
> >>
> >> i accept that we may be mistaken
> >>
> >
> > Well those commits made some changes, but I'm not sure what about them
> > you're saying is wrong?
> >
> > What would probably be most helpful is if you can dump out one of those
> > over-large incremental osdmaps and see what's using up all the space. (You
> > may be able to do it through the normal Ceph CLI by querying the monitor?
> > Otherwise if it's something very weird you may need to get the
> > ceph-dencoder tool and look at it with that.)
> > -Greg
> >
> >
> >>
> >>
> >> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum <gfarnum@xxxxxxxxxx>
> >> wrote:
> >>
> >>> Hmm that does seem odd. How are you looking at those sizes?
> >>>
> >>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov <palza00@xxxxxxxxx> wrote:
> >>>
> >>>> Greq, for example for our cluster ~1000 osd:
> >>>>
> >>>> size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
> >>>> modified 2018-12-12 04:00:17.661731)
> >>>> size osdmap.1357882__0_F7FE772D__none = 363KB
> >>>> size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
> >>>> modified 2018-12-12 04:00:27.385702)
> >>>> size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
> >>>>
> >>>> difference between epoch 1357881 and 1357883: crush weight one osd was
> >>>> increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
> >>>> inc_osdmap so huge
> >>>>
> >>>> чт, 6 дек. 2018 г. в 06:20, Gregory Farnum <gfarnum@xxxxxxxxxx>:
> >>>> >
> >>>> > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov <palza00@xxxxxxxxx>
> >>>> wrote:
> >>>> >>
> >>>> >> Hi guys
> >>>> >>
> >>>> >> I faced strange behavior of crushmap change. When I change crush
> >>>> >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
> >>>> >> significantly bigger than size of osdmap(0.4MB)
> >>>> >
> >>>> >
> >>>> > This is probably because when CRUSH changes, the new primary OSDs for
> >>>> a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
> >>>> reassigns it to the old acting set, so the data can be accessed while the
> >>>> new OSDs get backfilled. Depending on the size of your cluster, the number
> >>>> of PGs on it, and the size of the CRUSH change, this can easily be larger
> >>>> than the rest of the map because it is data with size linear in the number
> >>>> of PGs affected, instead of being more normally proportional to the number
> >>>> of OSDs.
> >>>> > -Greg
> >>>> >
> >>>> >>
> >>>> >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
> >>>> >> that initially it was firefly
> >>>> >> How can I view content of increment osdmap or can you give me opinion
> >>>> >> on this problem. I think that spikes of traffic tight after change of
> >>>> >> crushmap relates to this crushmap behavior
> >>>> >> _______________________________________________
> >>>> >> ceph-users mailing list
> >>>> >> ceph-users@xxxxxxxxxxxxxx
> >>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best regards, Sergey Dolgov
> >>>>
> >>>
> >>
> >> --
> >> Best regards, Sergey Dolgov
> >>
> >
> 
> -- 
> Best regards, Sergey Dolgov
>