Re: Question about caching on full table scans

"Albe Laurenz" <laurenz.albe@xxxxxxxxxx> · Thu, 30 Aug 2012 11:31:24 +0200

Markus Innerebner wrote:
> I am doing some runtime experiments in my implementation, which is
computing multi-modal range queries
> for a query point (if you want to know details check the website:
www.isochrones.inf.unibz.it).
> The network is explored using Dijkstra Shortest Path algorithm that
starts from the query point and
> starts to explore all connected vertices until the time is exceeded.
> The network is stored on a postgres (postgis) database consisting of
vertices and edges.
> relation: edge(id int, source int, target int, length double, segment
geometry,..)
> 
> I have different implementations how the network is loaded in main
memory:
> approach 1: loads initially the entire edge table (full table scan) in
main memory and then starts to
> expand the network and doing some computation.
> approach 2: loads only the adjacent edges of the current expanded
vertex
> approach 3: loads junks using the euclidean distance upper bound
> 
> I have different datasets: 6000 tuples (small), 4,000,000 tuples
(large)
> 
> I repeat each experiment at least 10 times.
> When observing the runtime I realized following:
> - in the first iteration approach 1 takes long time, and its runtime
starts to perform better after
> each iteration:
>     e.g. with large dataset
> 	- iteration 1:   60.0s
> 	- iteration 2:   40.7s
> 	- iteration 3:   40,s
> 	- iteration 4:   39.7s
> 	- iteration 5:   39.5s
> 	- iteration 6:   39.3s
> 	- iteration 7:   40.0s
> 	- iteration 8:   34.8s
> 	- iteration 9:   39.1s
> 	- iteration 10: 38.0s
> 
> In the other approaches I do not see that big difference.
> 
> I know that postgres (and OS) is caching that dataset. But is there a
way to force the database to
> remove that values from the cache?
> I also tried to perform after each iteration a scan on a dummy table
(executing it at least 10 times
> to force the optimized to keep that dummy data in main memory).
> But  I do not see any difference.
> 
> I thing the comparison is not right fair, if the caching in the main
memory approach brings that big
> advantage.
> 
> What can you as experts suggest me?

In your approach 1 to 3, what do you mean with "load into main memory"?
Do you
a) make sure that the data you talk about are in the PostgreSQL buffer
cache
or
b) retrieve the data from PostgreSQL and store it somewhere in your
application?

To clear PostgreSQL's cache, restart the server.
That should be a fast operation.
Since version 8.3, PostgreSQL is smart enough not to evict the
whole cache for a large sequential scan.

To flush the filesystem cache (from Linux 2.6.16 on), use
sync; echo 3 > /proc/sys/vm/drop_caches

Yours,
Laurenz Albe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance