Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

Pietro Pugni <pietro.pugni@xxxxxxxxx> · Fri, 3 Apr 2015 15:43:02 +0200

Hi didier,
thank you for your time.
I forgot to display before the output of free. I’ve looked into it before and I found difficult to fully understand if there was something wrong.

Before starting Postgres:
             total       used       free     shared    buffers     cached
Mem:          125G         9G       115G        15M       362M       8.1G
-/+ buffers/cache:       1.5G       124G
Swap:         127G         0B       127G

Here’s an example of free output when queries B_1 and B_2 are running (they’re part of the same transaction). Generally values remains the same. For what I can understand, RAM isn’t used at all (there’s a lot of unused RAM).

             total       used       free     shared    buffers     cached
Mem:          125G        13G       112G       3.1G       362M        11G
-/+ buffers/cache:       1.9G       123G
Swap:         127G         0B       127G

With Postgres running after transaction has been executed:
             total       used       free     shared    buffers     cached
Mem:          125G        13G       112G       3.1G       362M        11G
-/+ buffers/cache:       1.5G       124G
Swap:         127G         0B       127G

there's also huge page
/sys/kernel/mm/transparent_hugepage/enabled
can you try to disable it?
It was enabled and after disabling it nothing changed: time execution is practically the same (131s for the same transaction tested in previous emails, which is composed by queries B_1 and B_2).

Also test on the dell:
select tmp.cf, tmp.dt from grep_studi.tmp;
and
select tmp.cf, tmp.dt from grep_studi.tmp order by tmp.cf;
in Query B_2
the sort is 9 time slower on the dell, you have to find why…
Here’s the output for the two queries:

select tmp.cf, tmp.dt from grep_studi.tmp;
"Seq Scan on grep_studi.tmp  (cost=0.00..11007.74 rows=1346868 width=72) (actual time=0.082..618.709 rows=2951191 loops=1)"
"  Output: cf, dt"
"  Buffers: shared hit=512 read=7802 dirtied=8314"
"Planning time: 0.087 ms"
"Execution time: 745.505 ms"

select tmp.cf, tmp.dt from grep_studi.tmp;
"Sort  (cost=38431.55..39104.99 rows=1346868 width=72) (actual time=3146.548..3306.179 rows=2951191 loops=1)"
"  Output: cf, dt"
"  Sort Key: tmp.cf"
"  Sort Method: quicksort  Memory: 328866kB"
"  Buffers: shared hit=8317"
"  ->  Seq Scan on grep_studi.tmp  (cost=0.00..11007.74 rows=1346868 width=72) (actual time=0.012..373.346 rows=2951191 loops=1)"
"        Output: cf, dt"
"        Buffers: shared hit=8314"
"Planning time: 0.034 ms"
"Execution time: 3459.065 ms"

32 GB for buffers is too high for the queries in your test but it
doesn't matter.
I’ve set shared_buffers to be 1/4 of the total RAM. I’ve changed kernel values to accomodate this value. Lowering to smaller values doesn’t improve the transaction results. Here’s the results with 1 run for each level of shared_buffers:

32GB:	131s
16GB:	132s
8GB:	133s
4GB:	132s
2GB:	143s
1GB:	148s
512MB:	183s
256MB:	192s

Probably I can keep 4GB but I make use of several partitions with tens of millions of records each. This is why I keep shared_buffers high. My applications is also similar to a DWH solution with one user. Like you said, big values of shared_buffers shouldn’t be a issue.

I’ve done some tests with sysbench on Dell T420 and MacMini.

T420 - RAM READ - 16GB / 1MB
sh-4.3# sysbench --test=memory --memory-oper=read --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 16384M

Memory operations type: read
Memory scope type: global
Threads started!
Done.

Operations performed: 16384 (3643025.32 ops/sec)

16384.00 MB transferred (3643025.32 MB/sec)

Test execution summary:
    total time:                          0.0045s
    total number of events:              16384
    total time taken by event execution: 0.0031
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.02ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   0.0031/0.00

MacMini - RAM READ - 16GB / 1MB
server:sysbench Pietro$ ./sysbench --test=memory --memory-oper=read --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored

Threads started!

Operations performed: 16384 ( 5484.50 ops/sec)

16384.00 MB transferred (5484.50 MB/sec)

General statistics:
    total time:                          2.9873s
    total number of events:              16384
    total time taken by event execution: 2.9836s
    response time:
         min:                                  0.18ms
         avg:                                  0.18ms
         max:                                  0.24ms
         approx.  95 percentile:               0.19ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   2.9836/0.00

T420 - RAM WRITE - 16GB / 1MB
sh-4.3# sysbench --test=memory --memory-oper=write --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 16384M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 16384 ( 8298.97 ops/sec)

16384.00 MB transferred (8298.97 MB/sec)

Test execution summary:
    total time:                          1.9742s
    total number of events:              16384
    total time taken by event execution: 1.9723
    per-request statistics:
         min:                                  0.12ms
         avg:                                  0.12ms
         max:                                  0.25ms
         approx.  95 percentile:               0.12ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   1.9723/0.00

MacMini - RAM WRITE - 16GB / 1MB
server:sysbench Pietro$ ./sysbench --test=memory --memory-oper=write --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored

Threads started!

Operations performed: 16384 ( 5472.90 ops/sec)

16384.00 MB transferred (5472.90 MB/sec)

General statistics:
    total time:                          2.9937s
    total number of events:              16384
    total time taken by event execution: 2.9890s
    response time:
         min:                                  0.18ms
         avg:                                  0.18ms
         max:                                  0.32ms
         approx.  95 percentile:               0.19ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   2.9890/0.00

T420 - CPU
sh-4.3# sysbench --test=cpu run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000

Test execution summary:
    total time:                          13.0683s
    total number of events:              10000
    total time taken by event execution: 13.0674
    per-request statistics:
         min:                                  1.30ms
         avg:                                  1.31ms
         max:                                  1.44ms
         approx.  95 percentile:               1.35ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   13.0674/0.00

MacMini - CPU
server:sysbench Pietro$ ./sysbench --test=cpu run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored

Primer numbers limit: 10000

Threads started!

General statistics:
    total time:                          11.5728s
    total number of events:              10000
    total time taken by event execution: 11.5703s
    response time:
         min:                                  1.15ms
         avg:                                  1.16ms
         max:                                  2.17ms
         approx.  95 percentile:               1.17ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   11.5703/0.00

I’ve done these tests because someone else on this discussion asked me to investigate on memory bandwidth and because I found this interesting article about Intel Xeon vs Intel i5 with different Postgres versions: http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-pgbench
Hope this helps to better understand the problem.

Thank you very much.
Best regards,
 Pietro