Re: size of inc_osdmap vs osdmap

Sergey Dolgov <palza00@xxxxxxxxx> · Fri, 28 Dec 2018 00:20:12 +0300

We investigated the issue and set debug_mon up to 20 during little change of osdmap get many messages for all pgs of each pool (for all cluster):
2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789 prime_pg_tempnext_up === next_acting now, clear pg_temp
2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789 prime_pg_tempnext_up === next_acting now, clear pg_temp
2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789 prime_pg_tempnext_up === next_acting now, clear pg_temp
2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789 prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming []
2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789 prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789 prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
though no pg_temps are created as result(no single backfill)

We suppose this behavior changed in commit https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c because earlier function OSDMonitor::prime_pg_temp should return in https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009 like in jewel https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214

i accept that we may be mistaken 

On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
Hmm that does seem odd. How are you looking at those sizes?

On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov <palza00@xxxxxxxxx> wrote:
Greq, for example for our cluster ~1000 osd:

size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,

modified 2018-12-12 04:00:17.661731)

size osdmap.1357882__0_F7FE772D__none = 363KB

size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,

modified 2018-12-12 04:00:27.385702)

size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB

difference between epoch 1357881 and 1357883: crush weight one osd was

increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size

inc_osdmap so huge

чт, 6 дек. 2018 г. в 06:20, Gregory Farnum <gfarnum@xxxxxxxxxx>:

>

> On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov <palza00@xxxxxxxxx> wrote:

>>

>> Hi guys

>>

>> I faced strange behavior of crushmap change. When I change crush

>> weight osd I sometimes get  increment osdmap(1.2MB) which size is

>> significantly bigger than size of osdmap(0.4MB)

>

>

> This is probably because when CRUSH changes, the new primary OSDs for a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily reassigns it to the old acting set, so the data can be accessed while the new OSDs get backfilled. Depending on the size of your cluster, the number of PGs on it, and the size of the CRUSH change, this can easily be larger than the rest of the map because it is data with size linear in the number of PGs affected, instead of being more normally proportional to the number of OSDs.

> -Greg

>

>>

>> I use luminois 12.2.8. Cluster was installed a long ago, I suppose

>> that initially it was firefly

>> How can I view content of increment osdmap or can you give me opinion

>> on this problem. I think that spikes of traffic tight after change of

>> crushmap relates to this crushmap behavior

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Best regards, Sergey Dolgov

-- 
Best regards, Sergey Dolgov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com