Re: "Quadruple" memory usage with squid

Marcus Kool <marcus.kool@xxxxxxxxxxxxxxx> · Wed, 25 Nov 2009 10:43:30 -0200

The result of the test with vm.pmap.pg_ps_enabled set to 1
is ... different than what I expected.
The values of vm.pmap.pde.p_failures and vm.pmap.pde.demotions
indicate that the page daemon has problems creating and
maintaining superpages.  Maybe the load is too high
to create superpages or the algorithm of the page daemon
needs to be tuned.

Perhaps the FreeBSD list can offer more assistance
on the question of superpages and give an explanation
for the high values for vm.pmap.pde.p_failures and
vm.pmap.pde.demotions...
But one last try from me: The machine has 24 GB and Squid has
19 GB. I guess that on the first fork the OS demotes many
superpages because it needs to map the child process to
virtual memory and superpages cannot be swapped and therefore
will be demoted.  The second fork demotes more superpages...
To make the first fork fast, Squid must be
less than 10 GB because Squid and its child fit within
physical memory.  You will need a low value for
sleep_after_fork to do the exec before the next fork is
executed.

There are alternative solutions to the problem:
1. redesign the URL rewriter into a multithreaded application that
accepts multiple requests from Squid simultaneously (use
url_rewrite_concurrency in squid.conf)
This way there will be only one child process and only one fork on
'squid -k rotate'
2. redesign the URL rewriter where the URL rewriter rereads
its configuration/database on receipt of a signal and
send this signal every 24 hours without doing the 'squid -k rotate'
This way squid does not fork.
(maybe you want a 'squid -k rotate' once per week though).
3. modify the URL rewriter to accept multiple request
simultaneously (use url_rewrite_concurrency 4) and reduce the
number of rewriter to 4.  The URL rewritor will not be modified
to be multithreaded but simply queues the requests.
This way Squid does only 4 forks instead of 15.
4. use less URL rewriters. You might get an occasional
'not enough rewriters' warning from Squid in which case the
redirector might be bypassed (use url_rewrite_bypass).
Depending on what exactly the URL rewriter does, this
might be acceptable.
This way Squid does less forks.

option 1 requires a lot of work and is probable the best solution.
option 2 requires a lot of work and and is less robust than option 1
because at the time of the reconfiguration it cannot serve requests from Squid.
option 3 is most likely very simple but it is unknown how much it helps.
option 4 is simple, but depending on the functionality
of the rewriter, it is or is not acceptable.  You need to experiment
to see if it helps.

Marcus

Linda Messerschmidt wrote:
On Tue, Nov 24, 2009 at 9:52 PM, Marcus Kool
<marcus.kool@xxxxxxxxxxxxxxx> wrote:
Linda started this thread with huge performance problems
when Squid with a size of 12 GB forks 15 times.

I'm still not convinced that our memory usage is correct.  Here's
where the cache_mem 10GB process is at right now:

	Maximum Resident Size: 19376000 KB
	Total accounted:       9735764 KB
	Storage Mem size:	9296872 KB
	338229 StoreEntries
	338229 StoreEntries with MemObjects

So there is (9735764 - 9296872) 438,892 KB "overhead" (memory
accounted for that is not used for storing objects) and (19376000 -
9735764) 9,640,236 KB "unaccounted."

I will put the whole memory report at the bottom of this message
though it is very hard to read. :-)

Linda emailed me that she is doing a test with
vm.pmap.pg_ps_enabled set to 1 (the kernel will
transparently transform 4K pages into superpages)
which gives a big relief for TLB management
and will most likely reduce performance problems with
forks of very large processes.

I apologize I meant to send that to the whole list.

In any case, it did not help.  This is from top while it was doing
daily rotation a few moments ago:

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1205 squid       1 118    0 18650M 18675M CPU6    1  57:22 97.85% squid
74418 squid       1 118    0 16259M    24K CPU2    2   0:05 97.66% squid

It doesn't look like the superpages enabled had a lot of effect:

$ sysctl vm.pmap
vm.pmap.pmap_collect_active: 0
vm.pmap.pmap_collect_inactive: 0
vm.pmap.pv_entry_spare: 6673
vm.pmap.pv_entry_allocs: 105465793
vm.pmap.pv_entry_frees: 100913786
vm.pmap.pc_chunk_tryfail: 0
vm.pmap.pc_chunk_frees: 610560
vm.pmap.pc_chunk_allocs: 637695
vm.pmap.pc_chunk_count: 27135
vm.pmap.pv_entry_count: 4552007   <--------
vm.pmap.pde.promotions: 12657 <------
vm.pmap.pde.p_failures: 222291
vm.pmap.pde.mappings: 0
vm.pmap.pde.demotions: 11968 <-------
vm.pmap.shpgperproc: 200
vm.pmap.pv_entry_max: 7330186
vm.pmap.pg_ps_enabled: 1

Here are similar readings from the 16GB machine, where the squid
process has VSZ of 8599M:

vm.pmap.pmap_collect_active: 0
vm.pmap.pmap_collect_inactive: 0
vm.pmap.pv_entry_spare: 16299
vm.pmap.pv_entry_allocs: 2196230824
vm.pmap.pv_entry_frees: 2193907723
vm.pmap.pc_chunk_tryfail: 0
vm.pmap.pc_chunk_frees: 13548719
vm.pmap.pc_chunk_allocs: 13562644
vm.pmap.pc_chunk_count: 13925
vm.pmap.pv_entry_count: 2323101  <------------
vm.pmap.pde.promotions: 0
vm.pmap.pde.p_failures: 0
vm.pmap.pde.mappings: 0
vm.pmap.pde.demotions: 0
vm.pmap.shpgperproc: 200
vm.pmap.pv_entry_max: 4276871
vm.pmap.pg_ps_enabled: 0

Squid uses almost all of the memory in use on both systems, so the
rough value of:

No superpages: 2323101 / 8599 = 270 pages per MB
Yes superpages: 4552007 / 18650 = 244 pages per MB

That's not quite fair since it's the whole system's page table vs.
squid's address space but these machines are all about squid so it's
fair enough.

I added up all the system's VSZ and divided it into the updated number
of pages; the system average is just 4.554KB per page, so it seems
like there must be barely any super pages in use, which is borne out
by the (promotions - demotions) figure of (12657 - 11968) 689 active
superpages.  Assuming that means what I think it means.

can we modify Squid to fork at a slower pace?

I think there is a config option (sleep_after_fork) for that already.
But it would not help as the processes are already forked one at a
time and the page table duplication happens before the exec.  Even if
it did help, sleep_after_fork would have to be several seconds, and
squid would still be nonresponsive for sleep_after_fork * 16.

It sounds like this problem is a bit more serious and complicated than
I thought.  I really hoped it was a config error on our part. :-(

Here is the full memory report as the process VSZ crossed over to
19,001 MB, which shows a total of 9,317,346 KB in use:

Current memory usage:
Pool	 Obj Size	Allocated					 In Use Idle			 Allocations Saved			 Hit Rate	
 	 (bytes)	(#)	 (KB)	 high (KB)	 high (hrs)	 impact (%total)	(#)	
(KB)	 high (KB)	 portion (%alloc)	(#)	 (KB)	 high (KB)	(number)	
(%num)	 (%vol)	(%num)	(number)
2K Buffer           	 2048	 96	 192	 248	 0.52	 0	 91	 182	 248	 95	
5	 10	 94	 34958835	 4.42	 18.12	 99.96	 34972340
4K Buffer           	 4096	 199	 796	 3904	 11.98	 0	 194	 776	 3904	
97	 5	 20	 1720	 7225790	 0.91	 7.49	 98.41	 7342427
8K Buffer           	 8192	 21	 168	 272	 12.05	 0	 17	 136	 272	 81	
4	 32	 176	 1880965	 0.24	 3.90	 98.40	 1911548
16K Buffer          	 16384	 3	 48	 64	 13.39	 0	 0	 0	 64	 0	 3	 48	
64	 3600	 0.00	 0.01	 98.41	 3658
32K Buffer          	 32768	 2	 64	 128	 14.56	 0	 0	 0	 128	 0	 2	
64	 128	 369	 0.00	 0.00	 98.14	 376
64K Buffer          	 65536	 2	 128	 192	 12.81	 0	 0	 0	 192	 0	 2	
128	 128	 88	 0.00	 0.00	 95.65	 92
Store Client Buffer 	 4096	 22	 88	 1052	 12.05	 0	 11	 44	 1052	 50	
11	 44	 812	 30568572	 3.87	 31.69	 99.74	 30646811
acl                 	   64	 11	 1	 1	 15.35	 0	 11	 1	 1	 100	 0	 0	
1	 11	 0.00	 0.00	 50.00	 22
acl_ip_data         	   24	 7	 1	 1	 15.35	 0	 7	 1	 1	 100	 0	 0	 1	
7	 0.00	 0.00	 50.00	 14
acl_list            	   24	 18	 1	 1	 15.35	 0	 18	 1	 1	 100	 0	 0	
1	 18	 0.00	 0.00	 50.00	 36
dlink_node          	   24	 337	 8	 8	 0.52	 0	 0	 0	 8	 0	 337	 8	 8	
1088	 0.00	 0.00	 76.35	 1425
FwdServer           	   24	 41	 1	 6	 12.05	 0	 9	 1	 6	 22	 32	 1	 5	
3875752	 0.49	 0.02	 99.87	 3880603
HttpReply           	  168	 338927	 55606	 55606	 0.00	 1	 338922
55605	 55606	 100	 5	 1	 23	 10817354	 1.37	 0.46	 96.17	 11248509
HttpHeaderEntry     	   40	 2602377	 101656	 101656	 0.00	 1	 2602282	
101652	 101656	 100	 95	 4	 101	 200511069	 25.36	 2.03	 98.06	
204487269
HttpHdrCc           	   40	 27921	 1091	 1091	 0.01	 0	 27914	 1091	
1091	 100	 7	 1	 3	 7294400	 0.92	 0.07	 99.17	 7355123
HttpHdrRangeSpec    	   16	 96	 2	 2	 13.43	 0	 0	 0	 2	 0	 96	 2	 2	
25596	 0.00	 0.00	 99.31	 25775
HttpHdrRange        	   16	 2	 1	 1	 12.07	 0	 0	 0	 1	 0	 2	 1	 1	
21759	 0.00	 0.00	 99.44	 21881
HttpHdrContRange    	   24	 5	 1	 1	 13.46	 0	 0	 0	 1	 0	 5	 1	 1	
10713	 0.00	 0.00	 97.21	 11021
intlist             	   16	 3	 1	 1	 15.35	 0	 3	 1	 1	 100	 0	 0	 1	
3	 0.00	 0.00	 50.00	 6
MemObject           	  272	 338920	 90026	 90026	 0.00	 1	 338915
90025	 90026	 100	 5	 2	 39	 5022931	 0.64	 0.35	 93.21	 5388889
mem_node            	 4112	 2338883	 9392078	 9392078	 0.00	 96	
2338871	 9392029	 9392078	 100	 12	 49	 3908	 15349368	 1.94	 15.98	
81.20	 18902877
relist              	   48	 8	 1	 1	 15.35	 0	 8	 1	 1	 100	 0	 0	 1	
8	 0.00	 0.00	 50.00	 16
request_t           	 1360	 341	 453	 893	 12.05	 0	 326	 433	 893	
96	 15	 20	 496	 9456631	 1.20	 3.26	 98.60	 9590792
StoreEntry          	   88	 338920	 29126	 29126	 0.00	 0	 338915
29126	 29126	 100	 5	 1	 13	 5025544	 0.64	 0.11	 93.26	 5388889
wordlist            	   16	 14	 1	 1	 15.35	 0	 11	 1	 1	 79	 3	 1	 1	
88	 0.00	 0.00	 86.27	 102
ClientInfo          	  344	 28096	 9439	 14657	 12.40	 0	 15714	 5279	
14657	 56	 12382	 4160	 4852	 653528	 0.08	 0.06	 93.63	 697962
MD5 digest          	   16	 338920	 5296	 5296	 0.00	 0	 338915 5296	
5296	 100	 5	 1	 3	 13112283	 1.66	 0.05	 97.40	 13462778
helper_request      	   40	 353	 14	 14	 0.52	 0	 1	 1	 14	 0	 352	
14	 14	 4805587	 0.61	 0.05	 99.99	 4805971
storeSwapTLV        	   24	 5	 1	 1	 15.32	 0	 0	 0	 1	 0	 5	 1	 1	
47627493	 6.02	 0.29	 100.00	 47627498
Short Strings       	   36	 2939320	 103336	 103336	 0.00	 1	 2939241	
103333	 103336	 100	 79	 3	 75	 219162796	 27.72	 2.00	 98.15	
223283726
Medium Strings      	  128	 15166	 1896	 1951	 0.52	 0	 15126	 1891	
1951	 100	 40	 5	 89	 50029245	 6.33	 1.62	 99.39	 50334410
Long Strings        	  512	 5927	 2964	 3002	 0.52	 0	 5911	 2956	
3002	 100	 16	 8	 143	 9251916	 1.17	 1.20	 99.22	 9325056
cbdata acl_access (1)	   56	 13	 1	 1	 15.35	 0	 13	 1	 1	 100	 0	 0	
1	 13	 0.00	 0.00	 50.00	 26
cbdata aclCheck_t (2)	  312	
daf
 1	 1	 1	 15.32	 0	 0	 0	 1	 0	 1	 1	 1	 22575783	 2.86	 1.78	 100.00	 22575832
cbdata clientHttpRequest (3)	 1096	 21	 23	 481	 0.52	 0	 15	 17	 481	
71	 6	 7	 387	 4735510	 0.60	 1.31	 98.21	 4821935
cbdata ConnStateData (4)	  320	 194	 61	 299	 11.98	 0	 190	 60	 299	
98	 4	 2	 137	 2604058	 0.33	 0.21	 97.41	 2673373
cbdata ErrorState (5)	  160	 200	 32	 32	 0.06	 0	 199	 32	 32	 100	
1	 1	 1	 16266	 0.00	 0.00	 98.79	 16466
cbdata FwdState (6) 	   96	 31	 3	 24	 12.05	 0	 9	 1	 24	 29	 22	 3	
20	 3871284	 0.49	 0.09	 99.76	 3880603
cbdata generic_cbdata (7)	   32	 30	 1	 4	 15.32	 0	 0	 0	 4	 0	 30	
1	 4	 14750620	 1.87	 0.12	 100.00	 14750938
cbdata helper (8)   	  136	 1	 1	 1	 15.35	 0	 1	 1	 1	 100	 0	 0	 0	
0	 0.00	 0.00	 0.00	 1
cbdata helper_server (9)	  152	 16	 3	 3	 15.35	 0	 16	 3	 3	 100	 0	
0	 3	 32	 0.00	 0.00	 66.67	 48
cbdata HttpStateData (12)	  152	 38	 6	 37	 12.05	 0	 9	 2	 37	 24	
29	 5	 32	 3874204	 0.49	 0.15	 99.85	 3880194
cbdata ps_state (14)	  200	 1	 1	 1	 15.32	 0	 0	 0	 1	 0	 1	 1	 1	
3880602	 0.49	 0.20	 100.00	 3880603
cbdata RemovalPolicy (15)	  104	 1	 1	 1	 15.35	 0	 1	 1	 1	 100	 0	
0	 0	 0	 0.00	 0.00	 0.00	 1
cbdata store_client (18)	  128	 167	 21	 45	 12.05	 0	 133	 17	 45	
80	 34	 5	 29	 8643422	 1.09	 0.28	 99.89	 8652735
event               	   48	 34	 2	 3	 12.38	 0	 31	 2	 3	 91	 3	 1	 1	
8007262	 1.01	 0.10	 99.96	 8010458
cbdata http_port_list (19)	  128	 2	 1	 1	 14.75	 0	 2	 1	 1	 100	 0	
0	 0	 0	 0.00	 0.00	 0.00	 2
cbdata body_size (20)	   64	 1	 1	 1	 15.35	 0	 1	 1	 1	 100	 0	 0	 1	
1	 0.00	 0.00	 50.00	 2
cbdata ConnectStateData (21)	   96	 16	 2	 11	 15.32	 0	 0	 0	 11	 0	
16	 2	 11	 3826157	 0.48	 0.09	 99.96	 3827561
close_handler       	   24	 236	 6	 31	 12.05	 0	 224	 6	 31	 95	 12	
1	 13	 14200692	 1.80	 0.09	 99.57	 14261946
ipcache_entry       	  128	 978	 123	 137	 13.32	 0	 951	 119	 137	
97	 27	 4	 22	 611237	 0.08	 0.02	 98.86	 618298
fqdncache_entry     	  160	 20	 4	 4	 15.35	 0	 20	 4	 4	 100	 0	 0	
4	 24	 0.00	 0.00	 54.55	 44
cbdata idns_query (22)	 8680	 2	 17	 941	 15.32	 0	 0	 0	 941	 0	 2	
17	 941	 616884	 0.08	 1.36	 99.78	 618218
cbdata redirectStateData (23)	   72	 353	 25	 25	 0.52	 0	 1	 1	 25	
0	 352	 25	 25	 4805595	 0.61	 0.09	 99.99	 4805971
cbdata Logfile (26) 	 1120	 3	 4	 4	 15.35	 0	 3	 4	 4	 100	 0	 0	 4	
3	 0.00	 0.00	 50.00	 6
pconn_data          	   32	 6	 1	 1	 13.34	 0	 1	 1	 1	 17	 5	 1	 1	
61876	 0.01	 0.00	 99.69	 62069
pconn_fds           	   32	 6	 1	 1	 13.34	 0	 1	 1	 1	 17	 5	 1	 1	
61889	 0.01	 0.00	 99.71	 62069
cbdata AddVaryState (29)	  160	 1	 1	 2	 12.38	 0	 0	 0	 2	 0	 1	 1	
2	 305954	 0.04	 0.01	 99.84	 306435
cbdata LocateVaryState (30)	  136	 3	 1	 1	 12.38	 0	 0	 0	 1	 0	 3	
1	 1	 571615	 0.07	 0.02	 99.99	 571661
VaryData            	   32	 6	 1	 1	 13.62	 0	 1	 1	 1	 17	 5	 1	 1	
570046	 0.07	 0.00	 99.72	 571661
cbdata StoreUpdateState (31)	 4152	 2	 9	 41	 12.34	 0	 0	 0	 41	 0	
2	 9	 41	 1044654	 0.13	 1.10	 99.71	 1047729
Total               	 -   	 9317346	 9794810	 9794810	 0.00	 100	
9303255	 9790117	 9790185	 100	 14091	 4693	 5120	 776329091	 98.19	
95.79	 98.19	 790614788
Cumulative allocated volume: 395.07 GB
Current overhead: 301304 bytes (0.003%)
Idle pool limit: 5.00 MB
memPoolAlloc calls: 790614788
memPoolFree calls: 781311532
String Pool	 Impact		
 	 (%strings)	 (%volume)
Short Strings       	 99	 96
Medium Strings      	 1	 2
Long Strings        	 0	 3
Other Strings       	 0	 0

Large buffers: 0 (0 KB)

Are we 100% certain this is normal / correct?  Cutting the process
size in half would double the process spawn rate. :)

Thanks!