Re: Design Question (Time Series Data)

Ted Byers <r.ted.byers@xxxxxxxxxx> · Thu, 4 Oct 2007 10:30:47 -0400 (EDT)

--- Andreas Strasser <kontakt@xxxxxxxxxxxxxxxxxxxx>
wrote:

> Hello,
> 
> i'm currently designing an application that will
> retrieve economic data 
> (mainly time series)from different sources and
> distribute it to clients. 
> It is supposed to manage around 20.000 different
> series with differing 
> numbers of observations (some have only a few dozen
> observations, others 
> several thousand) and i'm now faced with the
> decision where and how to 
> store the data.

If you really have such a disparity among your series,
then it is a mistake to blend them into a single
table.  You really need to spend more time analyzing
what the data means.  If one data set is comprised of
the daily close price of a suite of stocks or mutual
funds, then it makes sense to include all such series
in a given table, but if some of the series are daily
close price and others are monthly averages, then it
is a mistake to combine them in a single table, and
two or more would be warranted.  Or if the data are
from different data feed vendors, then you have to
think very carefully whether or not the data can
logically be combined.

> 
> So far, i've come up with 3 possible solutions
> 
> 1) Storing the observations in one big table with
> fields for the series, 
> position within the series and the value (float)
> 2) Storing the observations in an array (either in
> the same table as the 
> series or in an extra data-table)
> 3) Storing the observations in CSV-files on the hard
> disk and only 
> putting a reference to it in the database
> 
I don't much like any of the above.  When I have had
to process data for financial consultants, I applied a
few simple filters to ensure the data is clean (e.g.
tests to ensure data hasn't been corrupted during
transmission, proper handling of missing data, &c.),
and then bulk loaded the data into a suite of tables
designed specifically to match the vendor's
definitions of what the data means.  Only then did we
apply specific analyses designed in consultation with
the financial consultant's specialists; folk best
qualified to help us understand how best to understand
the data and especially how it can be combined in a
meaningful way.

If the data are stored in a suite of well defined
tables, subsequent analyses are much more easily
designed, implemented and executed.

I do not know if PostgreSQL, or any other RDBMS,
includes the ability to call on software such as "R"
to do specific statistical analysis, but if I had to
do some time series analysis, I would do it in a
client application that retrieves the appropriate data
from the database and either does the analysis in
custom code I have written (usually in C++, as some of
my favourite analyses have not made it into commonly
available open source or commercial statistical
software) or invokes the appropriate functions from
statistical software I have at my disposal.  The
strategy I describe above makes the SQL required for
much of this dirt simple.

HTH

Ted

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
       message can get through to the mailing list cleanly