On 20 March 2012 20:19, Steve Crawford <scrawford@xxxxxxxxxxxxxxxxxxxx> wrote: > On 03/20/2012 04:27 PM, Jim Green wrote: >> >> Greetings list! >> I am pretty new to postgresql from mysql > > Welcome. > >> I have daily minute stock price data from 2005 on and each day with >> columns timestamp, open,high,low,close,volume and a few more. each >> day's data is about 1.2million rows. I want import all the data to >> postgresql and analyze using R with the help of Rpostgresql. >> >> right now I am having about 7000 tables for individual stock and I use >> perl to do inserts, it's very slow. > > I'm not sure I understand - please expand and clarify. > > If you have 7000 stocks, 1.2 million rows/day is only 171 > observations/stock/day or a little under 3-hours of data at 60 samples/hour. > Are there holes in the data or am I completely missing the setup? Hi: stocks are being delisted and added so not everyday I have 7000 stocks and 7000 is just a rough number. also lots of them are not liquid so holes are possible.. > > For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up to 8 > trading holidays), 6.5 trading-hours/day giving 100,000 rows/stock/year give > or take depending on holiday schedule. Over your time range, that will be > less than 1-million rows per stock. > > Without holes in the data, it seems you will be adding 2.7 million rows per > trading day. I spot checked 0302's data it has 1.2 million rows in it and would be similar for other dates. > > Is perl doing individual record inserts? Short of reconnecting for each > insert, this is the slowest option. Each insert is its own transaction and, > unless you are using a raid card with writeback enabled (and hopefully with > battery-backed cache if it is) you will hit some limitations imposed by your > disk's rotational speed. If you can ensure the data is sorted by stock, you > can start a transaction, write all the records for that stock, then commit > the transaction which should speed the operation. I use the per dbi and prepared statement to insert to table per symbol. > > Only because you are bulk adding historical data and can rebuild your > database if it is destroyed, you could consider turning off fsync while you > are importing historical data. Dropping indexes while you play catch-up can > help as well. I already turn fsync off.. but autovacuum uses lots of io and my 8 core cpu is really busy.. > > >> I would like to use copy or other >> bulk loading tool to load the daily raw gz data. but I need the split >> the file to per stock files first before I do bulk loading. I consider >> this a bit messy. >> >> I would seek advise on the following idea: >> store everything in a big table, partition by month(this gives a >> reasonable number of partitions) and do bulk loading on the daily >> file. my queries would consist mostly select on a particular symbol on >> a particular day. > > If you already have daily files, you might want to work out the process of > importing a day and apply that to your backlog. There are lots of options - > you might try experimenting. > > Given the type of queries you plan, simple partitioning by time period as > you described is a reasonable approach. > > You could import a day's data as a bulk copy then select from that table > into the various individual stock tables. BTW, you have plenty of RAM that > that daily bulk file will be cached and queries on each individual stock > will probably be quite fast. But you will need to wrap things in a > transaction or otherwise ensure that you can recover if things fail part-way > through that distribution of data. > > You could partition your data by stock symbol and use a trigger on the > parent to put the data into the correct table on insert. I am unsure how I've read thousands of partition might be too much.. > fast this will run - try it and test. One benefit is that your daily import > will fully succeed or fully fail. But you will need to update your > partitioning and triggers to deal with new stocks. You can, of course, have > the trigger choose which table to use based on the table name. This > eliminates the need to alter the trigger code but it is still recommended to > make the child-table in advance. > > > >> my hardware is 16G Ram, 4x5400rpm raid10 with enough space. > > Have you started with basic tuning. It is unlikely that whatever stock > PostgreSQL you have installed is suboptimal. (What PG version and OS are you > using?) Yes I do, I posted it in last reply. I run debian squeeze 64bit and 9.1.3 version PG.. It looks like alternatives are kind of complex to me, right now my approach(perl dbi and prepared insert) would take about 8/9 mins to insert a day's data. I think I'll probably just stick with it and wait. the autovacuum processes does a lot of io and make my pc unusable while I do the data inserts.. and I tested autovacuum off with not much success because of they are launched due to the transaction id wrap around issue. Thanks! Jim. > > Cheers, > Steve -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general