I'm running pgBadger over the log, and will get some queries to explain analyze them.
Daniel Cristian Cruz
クルズ クリスチアン ダニエル
The 9.2 cluster is running in the same server as the production, so I will try to compare some critical explains and publish on explain.depesz.com.
2013/4/21 Adrian Klaver <adrian.klaver@xxxxxxxxx>
-- On 04/21/2013 09:29 AM, Daniel Cristian Cruz wrote:
2013/4/21 Adrian Klaver <adrian.klaver@xxxxxxxxx
<mailto:adrian.klaver@gmail.com>>
1)
Major upgrade from 9.1.4 to 9.2.4.
Used pg_upgrade
Tested on VM with 9.2.4 and no problems.
Same machine used for production server 9.1.4 and 9.2.4
When complex queries where run on production server under 9.2.4
memory usage climbed out of control.
Unanswered questions:
a) Data set sizes between test and production machines, how do they
differ?
It's the same on both; we do a dump/restore every day to the development
/ issue team work.
Which begs the question, what is different about your test setup that makes it not act up?
test environment:
virtual server
8 cores
12 GB RAM
4GB SWAP
max_connections = 200
shared_buffers = 800MB
temp_buffers = 32MB
work_mem = 256MB
maintenance_work_mem = 768MB
(no max_stack_depth set)
shared_preload_libraries = '$libdir/plpgsql,$libdir/plpython2,$libdir/pgxml,$libdir/pg_stat_statements'
wal_level = minimal
checkpoint_segments = 15
archive_mode = off
max_wal_senders = 0
(no effective_cache_size set)
constraint_exclusion = partition
log_min_duration_statement = 1000
(no log_temp_files set)
(no statement_timeout set)
max_locks_per_transaction = 1024
production:
true server
24 cores
96GB RAM
50GB SWAP
max_connections = 1000
shared_buffers = 6GB
temp_buffers = 24MB
work_mem = 24MB
maintenance_work_mem = 128MB
max_stack_depth = 8MB
(no shared_preload_libraries, disabled after problems because pg_stat_statements was a new module)
wal_level = hot_standby
checkpoint_segments = 20
archive_mode = on
(archive_command set)
max_wal_senders = 1
effective_cache_size = 32GB
(no constraint_exclusion set)
log_min_duration_statement = 5000
log_temp_files = 0
statement_timeout = 300000
(no max_locks_per_transaction set)
We know that the test servers are running on VMs with fewer resources than the production server.
So:
Are the VMs running the same OS and OS version as the production server?
test: Red Hat Enterprise Linux Server release 5.5, Linux 2.6.18-194.26.1.el5 #1 SMP Fri Oct 29 14:21:16 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
production: CentOS release 5.5, Linux 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
What are 'hardware differences' between the test VMs and the physical server?
above.
Are the Postgres configurations different for the test vs production servers?
Yes, some of them, shown above.
I would guess the usage pattern is different, but in what way?
Number of connections/sessions?
300 connections in production, 50 in test.
INSERT/UPDATE/DELETE pattern?
test: just test cases, development cases and issue cases
production: for 1 minute (00:01:00.076343), 583 inserts, 306 updates and 13 deletes and 3300 transactions (xacts_commits)
Client software using the database?
Sites using PHP with and without connection pool and with and without Doctrine; a huge system with Java and Hibernate using the JBoss pooler, Java aplication is the main user.
For a lack of anything else pick one and try it on the various servers to see if something stands out.
b) What are the EXPLAIN/ANALYZE results for a query on 9.1.4, 9.2.4
test and 9.2.4 production?
Since there is no single query causing the problem, I don't know if it could help
Yes , here it is:
query1:
EXPLAIN ANALYZE SELECT ced.id_evento, ced.inicio, ced.termino, ced.evento, ced.id_eventos IS NOT NULL AS aula_dividida, ac.titulo, ced.id_tipo_evento, ced.tipo_evento, ac.media_referencia, p.nome, ef.nome AS nomeEspacoFisico, ( SELECT count ( pre2.presente ) > 0 FROM turma.presenca pre2 WHERE pre2.id_aula = ac.id_aula AND pre2.id_evento = ac.id_evento AND pre2.id_diario = '64469' ) AS presenca, ced.id_aula FROM recurso.consulta_evento_diario ced LEFT JOIN recurso.evento e USING ( id_evento ) LEFT JOIN recurso.espaco_fisico ef USING ( id_espaco_fisico ) LEFT JOIN turma.aula_calendario ac USING ( id_aula, id_evento ) LEFT JOIN recurso.evento_participante ep USING ( id_evento ) LEFT JOIN senai.pessoa p USING ( id_pessoa ) WHERE id_diario = '64469' AND ced.id_evento NOT IN ( SELECT ec.id_evento_sobreposto FROM recurso.evento_conflito ec WHERE ec.id_evento_sobreposto = ced.id_evento AND ec.ignorado IS NULL ) AND ced.inicio BETWEEN '2013-04-14 00:00:00' AND '2013-04-20 23:59:59.999999' ORDER BY inicio;
server 9.1:
server 9.2:
After run this one on server 9.2, RES memory reached 6.5GB, VIRT 15GB.
query 2:
EXPLAIN ANALYZE SELECT count ( consulta_diario.id_diario ) FROM turma.consulta_diario LEFT JOIN turma.turma_cancelamento ON consulta_diario.id_turma = turma_cancelamento.id_turma WHERE turma_cancelamento.id_turma IS NULL AND consulta_diario.id_unidade_curricular_tipo IN ( 1, 6, 7, 8 ) AND ( 8365 = ANY ( consulta_diario.id_colaborador_coordenadores ) OR 2252 = ANY ( consulta_diario.id_docentes ) ) AND consulta_diario.id_unidade_execucao IN ( 33, 33, 46, 46, 53, 53 ) AND consulta_diario.situacao_diario LIKE 'Em Andamento' LIMIT '2';
server 9.1:
server 9.2:
No change in RES memory after this one (stayed at 6.5GB).
query 3:
EXPLAIN ANALYZE WITH justificativas AS ( SELECT justificativa_falta_aula.id_matricula, justificativa_falta_aula.id_diario, justificativa_falta_aula.id_aula, justificativa_falta_aula.id_evento FROM turma.presenca JOIN recurso.evento ON evento.id_evento = presenca.id_evento LEFT JOIN matricula.justificativa_falta_aula JOIN matricula.justificativa_falta ON justificativa_falta.id_justificativa_falta = justificativa_falta_aula.id_justificativa_falta JOIN matricula.justificativa_falta_tipo ON justificativa_falta_tipo.id_justificativa_falta_tipo = justificativa_falta.id_justificativa_falta_tipo LEFT JOIN matricula.parecer ON parecer.id_parecer = justificativa_falta.id_parecer ON justificativa_falta_tipo.tipo = 'Abono' AND justificativa_falta_aula.id_matricula = presenca.id_matricula AND justificativa_falta_aula.id_diario = presenca.id_diario AND justificativa_falta_aula.id_aula = presenca.id_aula AND justificativa_falta_aula.id_evento = presenca.id_evento AND ( NOT justificativa_falta.encaminhar OR parecer.aceito ) WHERE justificativa_falta_aula.id_matricula = 147124 ) SELECT id_diario, to_char ( ( contagem.carga_uc - COALESCE ( contagem.carga_ausencias, 0 ) ) / contagem.carga_uc * 100, '990D99' ) AS frequencia FROM ( SELECT estudante.id_matricula, estudante.id_diario, extract ( EPOCH FROM SUM ( evento.termino - evento.inicio ) ) AS carga_uc, extract ( EPOCH FROM SUM ( CASE WHEN aula_confirmacao.confirmada AND evento.inicio <= CURRENT_DATE AND NOT presenca.presente AND justificativas.id_evento IS NULL THEN evento.termino - evento.inicio END ) ) AS carga_ausencias FROM turma.estudante JOIN turma.presenca ON presenca.id_diario = estudante.id_diario AND presenca.id_matricula = estudante.id_matricula JOIN recurso.evento ON evento.id_evento = presenca.id_evento LEFT JOIN turma.aula_confirmacao ON aula_confirmacao.id_evento = presenca.id_evento AND aula_confirmacao.id_aula = presenca.id_aula LEFT JOIN matricula.matricula_cancelamento ON matricula_cancelamento.id_matricula = estudante.id_matricula LEFT JOIN justificativas ON justificativas.id_matricula = presenca.id_matricula AND justificativas.id_diario = presenca.id_diario AND justificativas.id_aula = presenca.id_aula AND justificativas.id_evento = presenca.id_evento WHERE matricula_cancelamento.id_matricula IS NULL AND estudante.id_matricula = 147124 GROUP BY estudante.id_matricula, estudante.id_diario ) AS contagem;
server 9.1:
server 9.2:
100MB more in RES memory after this one.
Thanks,
Daniel Cristian Cruz
クルズ クリスチアン ダニエル