Re: VERY slow queries at random

Gunther Mayer <gunther.mayer@xxxxxxxxxxxxxx> · Thu, 07 Jun 2007 17:32:52 +0200

Scott Marlowe wrote:
Gunther Mayer wrote:
Hi there,

We run a small ISP with a FreeBSD/freeradius/postgresql 8.2.4 backend
and 200+ users. Authentication happens via UAM/hotspot and I see a lot
of authorisation and accounting packets that are handled via PL/PGSQL
functions directly in the database.

Everything seems to work 100% except that a few times a day I see

Jun  6 10:41:31 caligula postgres[57347]: [4-1] radiususer: LOG:
duration: 19929.291 ms  statement: SELECT fn_accounting_start(...)

in my logs. I'm logging slow queries with log_min_duration_statement =
500 in my postgresql.conf. Sometimes another query runs equally slow or
even slower (I've seen 139 seconds!!!) a few minutes before or after as
well, but then everything is back to normal.

Even though I haven't yet indexed my data I know that the system is
performant because my largest table (the accounting one) only has 5000+
rows, the entire database is only a few MB's and I have plenty of memory
(2GB), shared_buffers = 100MB and max_fsm_pages = 179200.  Also from
briefly enabling

log_parser_stats = on
log_planner_stats = on
log_executor_stats = on

I saw that most queries are 100% satisfied from cache so the disk
doesn't even get hit. Finally, the problem seems unrelated to load
because it happens at 4am just as likely as at peak traffic time.

What the heck could cause such erratic behaviour? I suspect some type of
resource problem but what and how could I dig deeper? 

Maybe your hard drive is set to spin down after a certain period of 
idle, and since most all your data is coming from memory, then it 
might be that on the rare occasion when it needs to hit the drive it's 
not spun up anymore.
I doubt that as a serious amount of logging is taking place on the box 
all the time which goes straight to disk. Also, no disk in the world 
would take more than a minute to spin up...
Maybe some other process is cranking up (cron jobs???) that are 
chewing up all your I/O bandwidth?
Hmm, I investigated that too but if that was the case the queries would 
run slow always at the same time of the day.
Hard to say.  Anything in the system logs that would give you a hint?  
Try correlating them by the time of the slow pgsql queries.
Nothing relevant in the system logs at the time of the slow query 
appearing. I have in the mean time tweaked syslog-ng.conf such that as 
soon as it detects a "duration: <greater than 500>ms" log message it 
spawns top and top -m io and redirects the output to file. At least in 
that way I can check what's keeping the system busy immediately *after* 
a slow query has occured. Of course now Murphy's law has it that since 
I've done that (30 hours ago) not a single slow query has fired, but 
hey, I'll look at the results once I have them.

On another note, autovacuum couldn't cause such issues, could it? I do 
have autovacuum enabled (autovacuum=on as well as 
stats_start_collector=on, stats_block_level = on and stats_row_level = 
on), is there any possibility that autovacuum is not as resource 
friendly as advertised?

Gunther