Re: size of inc_osdmap vs osdmap

Sergey Dolgov <palza00@xxxxxxxxx> · Thu, 3 Jan 2019 03:47:43 +0300

Thanks Grag
I dumped inc_osdmap to file
ceph-dencoder type OSDMap::Incremental import ./inc\\uosdmap.1378266__0_B7F36FFA__none decode dump_json  > inc_osdmap.txt
There are 52330 pgs(cluster has 52332 pgs) in structure 'new_pg_temp' and for all of them osd is empty. For examle short excerpt:

 {
    "osds": [],
    "pgid": "3.0"
  },
  {
    "osds": [],
    "pgid": "3.1"
  },
  {
    "osds": [],
    "pgid": "3.2"
  },
  {
    "osds": [],
    "pgid": "3.3"
  },

On Thu, Jan 3, 2019 at 2:13 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov <palza00@xxxxxxxxx> wrote:
We investigated the issue and set debug_mon up to 20 during little change of osdmap get many messages for all pgs of each pool (for all cluster):
2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789 prime_pg_tempnext_up === next_acting now, clear pg_temp
2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789 prime_pg_tempnext_up === next_acting now, clear pg_temp
2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789 prime_pg_tempnext_up === next_acting now, clear pg_temp
2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789 prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming []
2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789 prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789 prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
though no pg_temps are created as result(no single backfill)

We suppose this behavior changed in commit https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c because earlier function OSDMonitor::prime_pg_temp should return in https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009 like in jewel https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214

i accept that we may be mistaken 

Well those commits made some changes, but I'm not sure what about them you're saying is wrong?

What would probably be most helpful is if you can dump out one of those over-large incremental osdmaps and see what's using up all the space. (You may be able to do it through the normal Ceph CLI by querying the monitor? Otherwise if it's something very weird you may need to get the ceph-dencoder tool and look at it with that.)
-Greg

On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
Hmm that does seem odd. How are you looking at those sizes?

On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov <palza00@xxxxxxxxx> wrote:
Greq, for example for our cluster ~1000 osd:

size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,

modified 2018-12-12 04:00:17.661731)

size osdmap.1357882__0_F7FE772D__none = 363KB

size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,

modified 2018-12-12 04:00:27.385702)

size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB

difference between epoch 1357881 and 1357883: crush weight one osd was

increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size

inc_osdmap so huge

чт, 6 дек. 2018 г. в 06:20, Gregory Farnum <gfarnum@xxxxxxxxxx>:

>

> On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov <palza00@xxxxxxxxx> wrote:

>>

>> Hi guys

>>

>> I faced strange behavior of crushmap change. When I change crush

>> weight osd I sometimes get  increment osdmap(1.2MB) which size is

>> significantly bigger than size of osdmap(0.4MB)

>

>

> This is probably because when CRUSH changes, the new primary OSDs for a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily reassigns it to the old acting set, so the data can be accessed while the new OSDs get backfilled. Depending on the size of your cluster, the number of PGs on it, and the size of the CRUSH change, this can easily be larger than the rest of the map because it is data with size linear in the number of PGs affected, instead of being more normally proportional to the number of OSDs.

> -Greg

>

>>

>> I use luminois 12.2.8. Cluster was installed a long ago, I suppose

>> that initially it was firefly

>> How can I view content of increment osdmap or can you give me opinion

>> on this problem. I think that spikes of traffic tight after change of

>> crushmap relates to this crushmap behavior

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Best regards, Sergey Dolgov

-- 
Best regards, Sergey Dolgov

-- 
Best regards, Sergey Dolgov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com