Search Postgresql Archives

Re: Q:Aggregrating Weekly Production Data. How do you do it?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ow Mun Heng wrote:
> Data which runs in the vicinity of a few million a week.
>
> What are the methods which will effectively provide the
> min/max/average/count/stdev of the weekly sample size based on different
> products/software mix etc.
>
> and still be able to answer correctly, what's the average of data_1 over
> the pass 2 months?


So, are you viewing the problem as one of figuring out how to avoid having to store all this raw data permanently but still have the statistical summary value results available without having to recalculate each time?


>
> I can't just take the average of an 8 averages of each week)
>
> eg:
> wk   avg data_1
> w1 - 2
> ...average of past 2 months = ave(w1-w8) which is statistically wrong.


One approach would be to to save the count of values as well as the average. Then your eight-week average is calculated as a weighted average, i.e., each weeks average is weighted by the corresponding count of data values:

CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric
);

Then, for instance, the eight-week average is computed as

SELECT
   AVG(num_of_values * avg_of_values)/SUM(num_of_values) AS eight_week_avg
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();



>
> (getting min/max/count isn't much of an issue. Stdev is the main issue I
> believe)

You probably need to store each of those weekly summary statistic values too, at least for min and max, i.e.,

CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric,
min_of_values numeric,
max_of_values numeric
);

Then, for instance, overall basic statistics are computed as

SELECT
   COUNT(num_of_values),
   AVG(num_of_values * avg_of_values),
   MIN(min_of_values),
   MAX(max_of_values)
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();


Extending this design to include the variance is a more complicated. While you can compute the average for the the past eight weeks using a weighted average of each of the separate eight weeks, I think you actually need the raw data values for the whole eight weeks to figure the standard deviation -- I don't readily see how you could without it. A different possibility would be to maintain a running average and variance (rather than windowing those statistics for the sliding eight-week period), and then taking a weekly snap shot of the running values to use for trending.

>
> One such instance I've read about is..
>
> 1. Calculate sum of square of each sale
> ...
> 8. Stdev will be the square root of step 7
>
> The results are valid (verified with actual data) but I don't understand
> the logic. All the Statistical books I've read marked stdev as sqrt
> (sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
> confusion.

I think you copied the std dev formula incorrectly, but the eight step algorithm can be shown to get you to the same point, after a little analytical manipulation. The advantage to one over the other is that the eight-step procedure can be used to implement an on-the-fly calculation of mean and variance, i.e., you can maintain a "running average" and update the statistics as you collect more data and not have to maintain the entire detail data set history, as would be required by an attempt to implement the definition directly.






---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux