Also to give perspective on the equivalent writes on CLOG
I used the following script which runs for 10 sec to track all writes to
the clog directory and here is what it came up with... (This is with 500
users running)
# cat write.d
#!/usr/sbin/dtrace -s
syscall::write:entry
/execname=="postgres" &&
dirname(fds[arg0].fi_pathname)=="/export/home0/igen/pgdata/pg_clog"/
{
@write[fds[arg0].fi_pathname,arg1] = count();
}
tick-10sec
{
exit(0);
}
# ./write.d
dtrace: script './write.d' matched 2 probes
CPU ID FUNCTION:NAME
3 1026 :tick-10sec
/export/home0/igen/pgdata/pg_clog/001E
-2753028277088 1
#
I modified read.d to do a 5sec read
# ./read.d
dtrace: script './read.d' matched 3 probes
CPU ID FUNCTION:NAME
0 1 :BEGIN
0 1027 :tick-5sec
/export/home0/igen/pgdata/pg_clog/001F
-2753028268896 1
/export/home0/igen/pgdata/pg_clog/001F
-2753028252512 1
/export/home0/igen/pgdata/pg_clog/001F
-2753028285280 2
/export/home0/igen/pgdata/pg_clog/001F
-2753028277088 3
/export/home0/igen/pgdata/pg_clog/001F
-2753028236128 3
/export/home0/igen/pgdata/pg_clog/001E
-2753028285280 5
/export/home0/igen/pgdata/pg_clog/001E
-2753028236128 9
/export/home0/igen/pgdata/pg_clog/001E
-2753028277088 13
/export/home0/igen/pgdata/pg_clog/001E
-2753028268896 15
/export/home0/igen/pgdata/pg_clog/001E
-2753028252512 27
#
So the ratio of reads vs writes to clog files is pretty huge..
-Jignesh
Jignesh K. Shah wrote:
Tom,
Here is what I did:
I started aggregating all read information:
First I also had added group by pid (arg0,arg1, pid) and the counts
were all coming as 1
Then I just grouped by filename and location (arg0,arg1 of reads) and
the counts came back as
# cat read.d
#!/usr/sbin/dtrace -s
syscall::read:entry
/execname=="postgres"/
{
@read[fds[arg0].fi_pathname, arg1] = count();
}
# ./read.d
dtrace: script './read.d' matched 1 probe
^C
/export/home0/igen/pgdata/pg_clog/0014
-2753028293472 1
/export/home0/igen/pgdata/pg_clog/0014
-2753028277088 1
/export/home0/igen/pgdata/pg_clog/0015
-2753028244320 2
/export/home0/igen/pgdata/pg_clog/0015
-2753028268896 14
/export/home0/igen/pgdata/pg_clog/0015
-2753028260704 25
/export/home0/igen/pgdata/pg_clog/0015
-2753028252512 27
/export/home0/igen/pgdata/pg_clog/0015
-2753028277088 28
/export/home0/igen/pgdata/pg_clog/0015
-2753028293472 37
FYI I pressed ctrl-c within like less than a second
So to me this seems that multiple processes are reading the same page
from different pids. (This was with about 600 suers active.
Aparently we do have a problem that we are reading the same buffer
address again. (Same as not being cached anywhere or not finding it
in cache anywhere).
I reran lock wait script on couple of processes and did not see
CLogControlFileLock as a problem..
# ./83_lwlock_wait.d 14341
Lock Id Mode Count
WALInsertLock Exclusive 1
ProcArrayLock Exclusive 16
Lock Id Combined Time (ns)
WALInsertLock 383109
ProcArrayLock 198866236
# ./83_lwlock_wait.d 14607
Lock Id Mode Count
WALInsertLock Exclusive 2
ProcArrayLock Exclusive 15
Lock Id Combined Time (ns)
WALInsertLock 55243
ProcArrayLock 69700140
#
What will help you find out why it is reading the same page again?
-Jignesh
Jignesh K. Shah wrote:
I agree with Tom.. somehow I think increasing NUM_CLOG_BUFFERS is
just avoiding the symptom to a later value.. I promise to look more
into it before making any recommendations to increase NUM_CLOG_BUFFERs.
Because though "iGen" showed improvements in that area by increasing
num_clog_buffers , EAStress had shown no improvements.. Plus the
reason I think this is not the problem in 8.3beta1 since the Lock
Output clearly does not show CLOGControlFile as to be the issue which
I had seen in earlier case. So I dont think that increasing
NUM_CLOG_BUFFERS will change thing here.
Now I dont understand the code pretty well yet I see three hotspots
and not sure if they are related to each other
* ProcArrayLock waits - causing Waits as reported by
83_lockwait.d script
* SimpleLRUReadPage - causing read IOs as reported by
iostat/rsnoop.d
* GetSnapshotData - causing CPU utilization as reported by hotuser
But I will shut up and do more testing.
Regards,
Jignesh
Tom Lane wrote:
Josh Berkus <josh@xxxxxxxxxxxx> writes:
Actually, 32 made a significant difference as I recall ... do you
still have the figures for that, Jignesh?
I'd want to see a new set of test runs backing up any call for a change
in NUM_CLOG_BUFFERS --- we've changed enough stuff around this area
that
benchmarks using code from a few months back shouldn't carry a lot of
weight.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org