Ow Mun Heng wrote:
> Data which runs in the vicinity of a few million a week.
>
> What are the methods which will effectively provide the
> min/max/average/count/stdev of the weekly sample size based on different
> products/software mix etc.
>
> and still be able to answer correctly, what's the average of data_1 over
> the pass 2 months?
So, are you viewing the problem as one of figuring out how to avoid
having to store all this raw data permanently but still have the
statistical summary value results available without having to
recalculate each time?
>
> I can't just take the average of an 8 averages of each week)
>
> eg:
> wk avg data_1
> w1 - 2
> ...average of past 2 months = ave(w1-w8) which is statistically wrong.
One approach would be to to save the count of values as well as the
average. Then your eight-week average is calculated as a weighted
average, i.e., each weeks average is weighted by the corresponding count
of data values:
CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric
);
Then, for instance, the eight-week average is computed as
SELECT
AVG(num_of_values * avg_of_values)/SUM(num_of_values) AS eight_week_avg
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();
>
> (getting min/max/count isn't much of an issue. Stdev is the main issue I
> believe)
You probably need to store each of those weekly summary statistic values
too, at least for min and max, i.e.,
CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric,
min_of_values numeric,
max_of_values numeric
);
Then, for instance, overall basic statistics are computed as
SELECT
COUNT(num_of_values),
AVG(num_of_values * avg_of_values),
MIN(min_of_values),
MAX(max_of_values)
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();
Extending this design to include the variance is a more complicated.
While you can compute the average for the the past eight weeks using a
weighted average of each of the separate eight weeks, I think you
actually need the raw data values for the whole eight weeks to figure
the standard deviation -- I don't readily see how you could without it.
A different possibility would be to maintain a running average and
variance (rather than windowing those statistics for the sliding
eight-week period), and then taking a weekly snap shot of the running
values to use for trending.
>
> One such instance I've read about is..
>
> 1. Calculate sum of square of each sale
> ...
> 8. Stdev will be the square root of step 7
>
> The results are valid (verified with actual data) but I don't understand
> the logic. All the Statistical books I've read marked stdev as sqrt
> (sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
> confusion.
I think you copied the std dev formula incorrectly, but the eight step
algorithm can be shown to get you to the same point, after a little
analytical manipulation. The advantage to one over the other is that the
eight-step procedure can be used to implement an on-the-fly calculation
of mean and variance, i.e., you can maintain a "running average" and
update the statistics as you collect more data and not have to maintain
the entire detail data set history, as would be required by an attempt
to implement the definition directly.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster