On 03/20/2012 04:27 PM, Jim Green wrote:
Greetings list!
I am pretty new to postgresql from mysql
Welcome.
I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.
right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow.
I'm not sure I understand - please expand and clarify.
If you have 7000 stocks, 1.2 million rows/day is only 171
observations/stock/day or a little under 3-hours of data at 60
samples/hour. Are there holes in the data or am I completely missing the
setup?
For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up
to 8 trading holidays), 6.5 trading-hours/day giving 100,000
rows/stock/year give or take depending on holiday schedule. Over your
time range, that will be less than 1-million rows per stock.
Without holes in the data, it seems you will be adding 2.7 million rows
per trading day.
Is perl doing individual record inserts? Short of reconnecting for each
insert, this is the slowest option. Each insert is its own transaction
and, unless you are using a raid card with writeback enabled (and
hopefully with battery-backed cache if it is) you will hit some
limitations imposed by your disk's rotational speed. If you can ensure
the data is sorted by stock, you can start a transaction, write all the
records for that stock, then commit the transaction which should speed
the operation.
Only because you are bulk adding historical data and can rebuild your
database if it is destroyed, you could consider turning off fsync while
you are importing historical data. Dropping indexes while you play
catch-up can help as well.
I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.
I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.
If you already have daily files, you might want to work out the process
of importing a day and apply that to your backlog. There are lots of
options - you might try experimenting.
Given the type of queries you plan, simple partitioning by time period
as you described is a reasonable approach.
You could import a day's data as a bulk copy then select from that table
into the various individual stock tables. BTW, you have plenty of RAM
that that daily bulk file will be cached and queries on each individual
stock will probably be quite fast. But you will need to wrap things in a
transaction or otherwise ensure that you can recover if things fail
part-way through that distribution of data.
You could partition your data by stock symbol and use a trigger on the
parent to put the data into the correct table on insert. I am unsure how
fast this will run - try it and test. One benefit is that your daily
import will fully succeed or fully fail. But you will need to update
your partitioning and triggers to deal with new stocks. You can, of
course, have the trigger choose which table to use based on the table
name. This eliminates the need to alter the trigger code but it is still
recommended to make the child-table in advance.
my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
Have you started with basic tuning. It is unlikely that whatever stock
PostgreSQL you have installed is suboptimal. (What PG version and OS are
you using?)
Cheers,
Steve
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general