Re: huge price database question..

Jim Green <student.northwestern@xxxxxxxxx> · Tue, 20 Mar 2012 21:16:07 -0400

On 20 March 2012 20:19, Steve Crawford <scrawford@xxxxxxxxxxxxxxxxxxxx> wrote:
> On 03/20/2012 04:27 PM, Jim Green wrote:
>>
>> Greetings list!
>> I am pretty new to postgresql from mysql
>
> Welcome.
>
>> I have daily minute stock price data from 2005 on and each day with
>> columns timestamp, open,high,low,close,volume and a few more. each
>> day's data is about 1.2million rows. I want import all the data to
>> postgresql and analyze using R with the help of Rpostgresql.
>>
>> right now I am having about 7000 tables for individual stock and I use
>> perl to do inserts, it's very slow.
>
> I'm not sure I understand - please expand and clarify.
>
> If you have 7000 stocks, 1.2 million rows/day is only 171
> observations/stock/day or a little under 3-hours of data at 60 samples/hour.
> Are there holes in the data or am I completely missing the setup?

Hi:
stocks are being delisted and added so not everyday I have 7000 stocks
and 7000 is just a rough number. also lots of them are not liquid so
holes are possible..

>
> For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up to 8
> trading holidays), 6.5 trading-hours/day giving 100,000 rows/stock/year give
> or take depending on holiday schedule. Over your time range, that will be
> less than 1-million rows per stock.
>
> Without holes in the data, it seems you will be adding 2.7 million rows per
> trading day.

I spot checked 0302's data it has 1.2 million rows in it and would be
similar for other dates.

>
> Is perl doing individual record inserts? Short of reconnecting for each
> insert, this is the slowest option. Each insert is its own transaction and,
> unless you are using a raid card with writeback enabled (and hopefully with
> battery-backed cache if it is) you will hit some limitations imposed by your
> disk's rotational speed. If you can ensure the data is sorted by stock, you
> can start a transaction, write all the records for that stock, then commit
> the transaction which should speed the operation.

I use the per dbi and prepared statement to insert to table per symbol.

>
> Only because you are bulk adding historical data and can rebuild your
> database if it is destroyed, you could consider turning off fsync while you
> are importing historical data. Dropping indexes while you play catch-up can
> help as well.

I already turn fsync off.. but autovacuum uses lots of io and my 8
core cpu is really busy..

>
>
>>  I would like to use copy or other
>> bulk loading tool to load the daily raw gz data. but I need the split
>> the file to per stock files first before I do bulk loading. I consider
>> this a bit messy.
>>
>> I would seek advise on the following idea:
>> store everything in a big table, partition by month(this gives a
>> reasonable number of partitions) and do bulk loading on the daily
>> file. my queries would consist mostly select on a particular symbol on
>> a particular day.
>
> If you already have daily files, you might want to work out the process of
> importing a day and apply that to your backlog. There are lots of options -
> you might try experimenting.
>
> Given the type of queries you plan, simple partitioning by time period as
> you described is a reasonable approach.
>
> You could import a day's data as a bulk copy then select from that table
> into the various individual stock tables. BTW, you have plenty of RAM that
> that daily bulk file will be cached and queries on each individual stock
> will probably be quite fast. But you will need to wrap things in a
> transaction or otherwise ensure that you can recover if things fail part-way
> through that distribution of data.
>
> You could partition your data by stock symbol and use a trigger on the
> parent to put the data into the correct table on insert. I am unsure how

I've read thousands of partition might be too much..

> fast this will run - try it and test. One benefit is that your daily import
> will fully succeed or fully fail. But you will need to update your
> partitioning and triggers to deal with new stocks. You can, of course, have
> the trigger choose which table to use based on the table name. This
> eliminates the need to alter the trigger code but it is still recommended to
> make the child-table in advance.
>
>
>
>> my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
>
> Have you started with basic tuning. It is unlikely that whatever stock
> PostgreSQL you have installed is suboptimal. (What PG version and OS are you
> using?)

Yes I do, I posted it in last reply. I run debian squeeze 64bit and
9.1.3 version PG..

It looks like alternatives are kind of complex to me, right now my
approach(perl dbi and prepared insert) would take about 8/9 mins to
insert  a day's data.  I think I'll probably just stick with it and
wait.

the autovacuum processes does a lot of io and make my pc unusable
while I do the data inserts.. and I tested autovacuum off with not
much success because of they are launched due to the transaction id
wrap around issue.

Thanks!

Jim.

>
> Cheers,
> Steve

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general