RE: squid: Memory utilization higher than expected since moving from 3.3 to 3.4 and Vary: working

Martin Sperl <Martin.Sperl@xxxxxxxxxx> · Wed, 16 Jul 2014 08:54:09 +0000

Any Idea as to what could be the issue - since the last post 2 days ago the memory footprint has increased by 1.2GB to 9424592Kb.
With now 788561 hot objects (from 651556)

The number of StoreEntry-pool objects (which I assume is the number of real objects in memory cache) has even decreased (from 173746 to 173624).

I created a full dump of mgr:vm_objects and there I find 789431 KEYS (so in principle close to the number of hot objects).
The question is: could we infer some information from this output?

Here some statistics on those Objects:
Distribution of 1st line after KEY:
Count		line1
 718698         	STORE_OK      IN_MEMORY     SWAPOUT_DONE PING_DONE
  63156         	STORE_OK      IN_MEMORY     SWAPOUT_DONE PING_NONE
   7516         	STORE_OK      IN_MEMORY     SWAPOUT_NONE PING_DONE
     51         	STORE_OK      IN_MEMORY     SWAPOUT_NONE PING_NONE
      6         	STORE_OK      NOT_IN_MEMORY SWAPOUT_NONE PING_DONE
      3         	STORE_PENDING NOT_IN_MEMORY SWAPOUT_NONE PING_DONE
      1         	STORE_PENDING NOT_IN_MEMORY SWAPOUT_NONE PING_NONE

Distribution of 2nd line after KEY:
Count		line2
 515372	REVALIDATE,CACHABLE,DISPATCHED,VALIDATED
 237538	CACHABLE,DISPATCHED,VALIDATED
  28944		CACHABLE,VALIDATED
   7048		CACHABLE,DISPATCHED,NEGCACHED,VALIDATED
    468		REVALIDATE,CACHABLE,DISPATCHED,NEGCACHED,VALIDATED
     51		SPECIAL,CACHABLE,VALIDATED
      5		RELEASE_REQUEST,DISPATCHED,PRIVATE,VALIDATED
      2		REVALIDATE,RELEASE_REQUEST,DISPATCHED,PRIVATE,VALIDATED
      2		CACHABLE,DISPATCHED,PRIVATE,FWD_HDR_WAIT,VALIDATED
      1		DELAY_SENDING,RELEASE_REQUEST,PRIVATE,VALIDATED

Here the count of objects that have the same URL:
Obj_count number of URL occurences
720711 1
  23276 2
   2216 3
   1134 4
    588 5
    283 6
    214 7
    111 8
     72 9
     70 10
     81 11
     37 12
     30 13
     21 14
      4 15
      2 16
      3 17
      5 18
     10 19
      1 20
      1 21
      1 22
      2 25
      2 28
(>1 would indicate a VARY policy is in place and we have multiple objects)

Objects with vary_headers in object: 40591

If I sum up the "inmem_hi:" values I get: 2918369522, so 2.9GB.

So it seems as if there must be some major overhead for those inmem objects...

If I look at "locks, * clients, * refs" and there specifically at the refs value I get the following distribution:
Obj_count ref_val
  12240 0
 592487 1
  78355 2
  25285 3
  12901 4
   8173 5
   5787 6
   4100 7
   3143 8
   2541 9
   2318 10
   1859 11
   1725 12
   1470 13
   1275 14
   1231 15
   1042 16
    867 17
    853 18
    723 19
    643 20
    669 21
    631 22
    574 23
    496 24
    469 25
    431 26
    423 27
    464 28
    394 29
    357 30
    368 31
    350 32
    315 33
    330 34
    280 35
    299 36
    239 37
    264 38
    218 39
...
      1 65000
      1 65017
      1 65028
      1 65074
      1 65089
      1 65183
      1 65248
      1 65299
      1 65364
      1 65521

As for Expiry-times - here the Expiry-time in days relative to "now" with the number of objects in cache:
Obj_count EXP days in the past
  42511   -16267
  12585    -6199
      1     -209
      1     -172
      1     -171
      2     -169
      1     -157
      1     -149
   1635      -85
   2233      -84
    701      -83
    388      -82
    336      -81
    234      -80
    175      -79
    139      -78
     88      -77
     85      -76
     63      -75
     82      -74
     58      -73
     48      -72
     49      -71
     49      -70
     32      -69
     50      -68
     20      -67
     25      -66
     32      -65
     49      -64
     22      -63
     39      -62
     32      -61
     32      -60
     19      -59
     13      -58
      9      -57
     10      -56
     24      -55
     14      -54
     47      -53
     24      -52
     27      -51
     24      -50
     17      -49
     36      -48
     75      -47
     38      -46
     58      -45
     61      -44
     14      -43
     55      -42
     23      -41
     27      -40
     42      -39
     53      -38
     46      -37
     68      -36
    101      -35
     52      -34
     52      -33
     35      -32
     88      -31
     39      -30
     39      -29
     58      -28
     86      -27
     77      -26
     83      -25
     83      -24
     77      -23
     79      -22
    123      -21
    123      -20
    176      -19
    128      -18
    170      -17
    141      -16
    153      -15
    144      -14
    101      -13
    122      -12
    342      -11
    220      -10
    177       -9
    212       -8
  27001       -7
  61767       -6
  71550       -5
  79084       -4
  82293       -3
  91091       -2
113077       -1
102432       -0
  79068        0
  13197        1
      1        8
    286      168
    121      169
     57      170
     30      171
     17      172
    114      173
    610      174
    656      175
    325      176
    169      177
    245      178
    198      179
     55      180
     30      233
      7      234
      3      246
      3      269
      1      288
      1      317
      1      331
      1      336
      1      340
      1      343
      3      349
      3      350
      1      352
      1      353
      1      355
      4      358
      2      360
      1      361
      2      362
      3      363
      2      364
      7      365
      3     3376
      1     3650

So of all the all the 789431 objects  694199 Objects have EX:... in the past - that is 88% of all objects!
And 65904 of those have an expiry date older than the start of the squid process.

LastUpdated distribution shows: the following distribution:
Obj_count last updated:
    195        0
   6056       -0
  10468       -1
   6085       -2
   4321       -3
   8896       -4
   5172       -5
   7925       -6
  13158       -7
   9479       -8
   1368       -9
    826      -10
    681      -11
    376      -12
   2489      -13
     63      -14
   9305      -15
   1912      -16
   1804      -18
    630      -19
   2982      -20
     11      -21
   1171      -22
   4629      -23
      1      -24
   7194      -25
    275      -27
      4      -28
  12798      -29
   3024      -30
   5054      -32
...
      1    -3288
      1    -3290
      1    -3307
      6    -3327
      4    -3374
     33    -3375
      3    -3381
     25    -3390
      2    -4547
 164525   -16267

And Last referenced:
Objcount days ago
  36823        0
140468       -0
127974       -1
 104453       -2
  86550       -3
  87259       -4
  79582       -5
  77286       -6
  49036       -7

In summary the way that I interpret it is:
* it seems as if the memory_overhead per cache_object is quite high
* there seem to be a lot of objects that have expired but have never been evicted from cache
* possibly eviction does not happen because the calculated cache size is only 2.9GB with 4GB being configured as Max memory...

So the question is: why do we underestimate memory_object sizes by a factor of aproximately 2?

Does this help with the analysis?

Thanks,
	Martin

-----Original Message-----
From: Martin Sperl 
Sent: Montag, 14. Juli 2014 13:37
To: Amos Jeffries; squid-users@xxxxxxxxxxxxxxx
Subject: RE:  squid: Memory utilization higher than expected since moving from 3.3 to 3.4 and Vary: working

Hi!

I did a bit of an analysis of the data gathered so far.

Current status: 8236072KB of allocated memory by squid since restart of squid on the 8th, so about 5-6 days.

The following memory pools have most of an increase in the last 2 days (>100kB):
Type-date		KB-20140712	KB-20140714	KB-Delta	Cnt-20140712	Cnt-20140714	Cnt-Delta
Total               		5629096	7494319	1865223	26439770	33704210	7264440
mem_node            	2375038	3192370	817332	588017	790374	202357
4K Buffer           		1138460	1499996	361536	284615	374999	90384
Short Strings       	456219	606107	149888	11679188	15516319	3837131
16K Buffer          	213120	323120	110000	13320		20195		6875
HttpHeaderEntry     	312495	415162	102667	5714194	7591516	1877322
2K Buffer           		249876	351226	101350	124938	175613	50675
8K Buffer           		135944	182360	46416		16993		22795		5802
HttpReply           	133991	178174	44183		490023	651607	161584
MemObject           	114845	152713	37868		490004	651575	161571
Medium Strings      	90893		120859	29966		727141	966866	239725
cbdata BodyPipe (39)	65367		88238		22871		440363	594443	154080
HttpHdrCc           	41486		55327		13841		442515	590153	147638
32K Buffer          	23584		35360		11776		737		1105		368
cbdata MemBuf (13)  	30627		40726		10099		490026	651615	161589
LRU policy node     	46068		49871		3803		1965553	2127797	162244
64K Buffer          	1664		2240		576		26		35		9
Long Strings        	1444		2007		563		2888		4014		1126
StoreEntry          	173530	173746	216		1480781	1482628	1847

All of those have linear increases.
They also show similar "wavy" behavior - when one has a "bump" then some of the others also have a Bump.

So now there are several "groups":
* pools that stay constant (wordlist,...)
* pools that show variability like our traffic-curves (Comm::Connections)
* pools that increase minimally (starting at 80% of current KB 2 days ago) (ip_cache_entry, LRU_policy_node)
* pool that increases a bit (starting at 35% of current KB 2days ago) fqdncache_entry
* Pools that increase a lot (starting at below 20% of the currend KB 2 days ago) - which are (sorted from Biggest to smallest KB footprint):
** mem_node
** 4K Buffer
** Short Strings
** HttpHeaderEntry
** 2K Buffer
** 16K Buffer
** 8K Buffer
** Http Reply
** Mem Object
** Medium Strings
** cbdata BodyPipe (39)
** HttpHdrCc
** cbdata MemBuff(13)
** 32K Buffer
** Long Strings

So there must be something that links all of those in the last group together.

If you again look at the delta of the % between hours one can find that most of those show again a "traffic-curve" pattern in the increase (which is the wavy part I was talking about earlier)
All of the pools in this specific group show (again) similar behavior with similar ratios.

So it seems to me as we keeping too much information in our cache, which never gets evicted - as I had said earlier: my guess would be the extra info to manage "Vary" possibly related to some cleanup processes not evicting all the "related" objects in cache...

This is when I started to look at some other variation reported in other values.

So here the values of "StoreEntries" for the last few days:
20140709-020001:        1472007 StoreEntries
20140710-020001:        1475545 StoreEntries
20140711-020001:        1478025 StoreEntries
20140712-020001:        1480771 StoreEntries
20140713-020001:        1481721 StoreEntries
20140714-020001:        1482608 StoreEntries
These stayed almost constant...

But looking at " StoreEntries with MemObjects" the picture is totally different.
20140709-020001:        128542 StoreEntries with MemObjects
20140710-020001:        275923 StoreEntries with MemObjects
20140711-020001:        387990 StoreEntries with MemObjects
20140712-020001:        489994 StoreEntries with MemObjects
20140713-020001:        571872 StoreEntries with MemObjects
20140714-020001:        651560 StoreEntries with MemObjects

And "on disk objects":
20140709-020001:        1470163 on-disk objects
20140710-020001:        1472215 on-disk objects
20140711-020001:        1473671 on-disk objects
20140712-020001:        1475614 on-disk objects
20140713-020001:        1475933 on-disk objects
20140714-020001:        1476291 on-disk objects
(constant again)

And " Hot Object Cache Items":
20140709-020001:        128532 Hot Object Cache Items
20140710-020001:        275907 Hot Object Cache Items
20140711-020001:        387985 Hot Object Cache Items
20140712-020001:        489989 Hot Object Cache Items
20140713-020001:        571862 Hot Object Cache Items
20140714-020001:        651556 Hot Object Cache Items

So if you look at the finer details and traffic pattern we again see that traffic pattern for:
* storeEntries with MemObjects
* Hot Object Cache Items

And these show similar behavior to the pools mentioned above.
The other 2 types stay fairly constant and also decrease in count.

So maybe all this gives additional evidence which objects are using so much more memory.

Did this give any hints?

Do you want to see any other data gathered?

Martin

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp