On Wed, Mar 21, 2012 at 12:57 PM, Andy Colson <andy@xxxxxxxxxxxxxxx> wrote:
Don't get me wrong. I DO use R for all kinds of stats and graphics stuff. I just tested avg() right now to test the claim made above.
I did some garbage collection and reran the aggregate() on half the data and this time it only took 80 seconds. Much faster, though still an order of magnitude slower than running it inside Postgres. The 700,000 row data frame in an an "unclean" R session swamped the 6 GB RAM on my desktop. Maybe the original test was unfair. On the other hand, it does directly speak to John's comment that "avg() in the database is going to be a lot faster than copying the data into memory for an application to process."
Interesting:
COPY statement (same table, 700,000 rows): 7761 ms
SFTP file copy from Postgres server to local machine: 21 seconds (didn't time it and my FTP client didn't report it)
read.csv (R command): 62 seconds
Total ~ 91 seconds
R commands:
library(RPostgreSQL)
<declare connection object>
dfSomeTable = dbGetQuery(conn, "SELECT * FROM some_table")
~ 85 seconds
So whether you export to CSV first or use RPostgreSQL, it's roughly equivalent, though I would tend to favor RPostgreSQL just because it's one step instead of three.
On the other hand:
sql = "SELECT state, avg(col1), avg(col2), [...] avg(col231) FROM some_table GROUP BY state;"
dfGrouped = dbGetQuery(conn, sql)
~ 5 seconds
So aggregating inside Posgres, then the smaller data set via RPostgreSQL wins on two counts, speed of aggregation and speed of transfer.
This is run in R. Data frame (dfSomeTable in this code snippet) is already in R, so no trips to Postgres are taking up time in this process. The real reason for the inflated time is pointed out above.
--Lee
--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/
On 3/21/2012 11:45 AM, Lee Hachadoorian wrote:
<mailto:pierce@xxxxxxxxxxxx>> wrote:
> avg() in the database is going to be a lot faster than copying
the data into
> memory for an application to process.
As an example, I ran average on a 700,000 row table with 231 census
variables reported by state. Running average on all 231 columns grouping
by state inside Postgres beat running it by R by a factor of 130 NOT
COUNTING an additional minute or so to pull the table from Postgres to
R. To be fair, these numbers are not strictly comparable, because it's
running on different hardware. But the setup is not atypical: Postgres
is running on a heavy hitting server while R is running on my desktop.
SELECT state, avg(col1), avg(col2), [...] avg(col231)
FROM some_table
GROUP BY state;
5741 ms
aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm =
TRUE)
754746 ms
--Lee
avg() might not be a good example though. If you just want average, great, but what about when you want to do lots of different stats'y functions that PG does not have? I'll bet R (not pl-R) can read a text file (a csv probably) pretty fast.
Don't get me wrong. I DO use R for all kinds of stats and graphics stuff. I just tested avg() right now to test the claim made above.
I did some garbage collection and reran the aggregate() on half the data and this time it only took 80 seconds. Much faster, though still an order of magnitude slower than running it inside Postgres. The 700,000 row data frame in an an "unclean" R session swamped the 6 GB RAM on my desktop. Maybe the original test was unfair. On the other hand, it does directly speak to John's comment that "avg() in the database is going to be a lot faster than copying the data into memory for an application to process."
If you were doing a really complex analysis I bet dumping a dataset out to csv first and then running R scripts over it would be fast and useful.
Interesting:
COPY statement (same table, 700,000 rows): 7761 ms
SFTP file copy from Postgres server to local machine: 21 seconds (didn't time it and my FTP client didn't report it)
read.csv (R command): 62 seconds
Total ~ 91 seconds
R commands:
library(RPostgreSQL)
<declare connection object>
dfSomeTable = dbGetQuery(conn, "SELECT * FROM some_table")
~ 85 seconds
So whether you export to CSV first or use RPostgreSQL, it's roughly equivalent, though I would tend to favor RPostgreSQL just because it's one step instead of three.
On the other hand:
sql = "SELECT state, avg(col1), avg(col2), [...] avg(col231) FROM some_table GROUP BY state;"
dfGrouped = dbGetQuery(conn, sql)
~ 5 seconds
So aggregating inside Posgres, then the smaller data set via RPostgreSQL wins on two counts, speed of aggregation and speed of transfer.
Is this a Postgres stored proc (pl-R)? Or is that R itself? If its plR then I wonder if its stepping through the recordset twice. Depending on how the function is written, I can see the function firing off a query, PG collects the entire recordset, and hands it back to the function, which then iterates thru it again. (vs. having the function called for each row as PG steps thru the recordset only once).
> aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm =
> TRUE)
This is run in R. Data frame (dfSomeTable in this code snippet) is already in R, so no trips to Postgres are taking up time in this process. The real reason for the inflated time is pointed out above.
--Lee
--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/