Re: 10 TB database

Greg Smith <gsmith@xxxxxxxxxxxxx> · Tue, 16 Jun 2009 12:13:20 -0400 (EDT)

On Mon, 15 Jun 2009, Artur wrote:

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

I hope your idea of fun involves spending a bunch of money on hardware and 
endless tweaking to get data loading every day with appropriate 
corrections, because that's just the first round of "fun" on a job like 
this.

The problem is that we expect to have more than 250 GB of data every 
month. This data would be in two tables. About 50.000.000 new rows every 
month. We want to have access to all the date mostly for generating user 
requesting reports (aggregating). We would have about 10TB of data in 
three years. Do you think is it possible to build this with postgresql 
and have any idea how to start? :)

You start by figuring out what sort of business model is going to justify 
this very expensive adventure in today's market where buyers of financial 
market products are pretty rare, but that's probably not the question you 
wanted an answer to.

You can certainly build a server capable of handling this job with 
PostgreSQL here in 2009.  Get 8 cores, a stack of 24 1TB disks and a RAID 
card with a write cache, and you'll have a big enough system to handle the 
job.  Basic database design isn't too terribly difficult either.  Stock 
data is trivial to partition up into tiny pieces at the database level 
(each day can be its own 250GB partition), and any system capable of 
holding that much data is going to have a giant stack of drives spreading 
out the disk I/O too.

The first level of problems you'll run into are how to keep up with 
loading data every day.  The main way to get bulk data in PostgreSQL, 
COPY, isn't particularly fast, and you'll be hard pressed to keep up with 
250GB/day unless you write a custom data loader that keeps multiple cores 
going with that load.  Commercial databases have some better solutions to 
solve this problem in the base product, or easily available from third 
party sources.

The much, much bigger problem here is how exactly you're going to provide 
a user interface to this data.  You can't just give people access to the 
whole thing and let them run queries; the first person who executes 
something like "select symbol,avg(price) from tickdata group by symbol" 
because they want to see the average price of some stock over its lifetime 
is going to kill the whole server.  You really need to generate the 
aggregated reports ahead of time, using an approach like materialized 
views, and then only let people grab those.  It's possible to manually 
create materialized views in PostgreSQL, but that will be yet another bit 
of custom development here.

The third level of issue is how you scale the app up if you're actually 
successful.  It's hard enough to get 250GB of daily data loaded into a 
single database and storing 10TB of data somewhere; doing the job across a 
replicated set of servers, so you can spread the queries out, is even more 
"fun" than that.

P.S. If you're not already familiar with how to aggregate properly over a 
trading calendar that includes holidays and spots where the market is only 
open part of the day, give up now; that's the hardest issue specific to 
this particular type of application to get right, and a lot of people 
don't realize that early enough in the design process to properly plan for 
it.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general