Re: Variance Function

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi Richard,

your email was hard to follow, and I don't have real answers for
you but maybe my simpleton's view of the situation might offer
you new avenues of thought to consider.

Richard Lynch wrote:
> It's been 20+ years since I took a stats class...

20 years ago I was mostly riding a push bike ... and I've never
taken a stats class as such (bare this in mind :-)

> 
> I didn't enjoy that class, and doubt if I remember 1% of what was
> covered.

you'd be 1 up on me ;-)

> 


...

> 
> And the sheer number of functions in the stats package is making my
> head spin.
> 

...

> 
> Some fools have their PC clock set to, like, 1970 or whatever.  So
> let's be generous and assume their CMOS battery has died, and they
> haven't had a chance to change it.  Fine.  Deal with it.
> 
> Okay, so *NOW* the algorithm is to do this:
> 
> Take the Date: header, or Sent: header if no Date: header -> $whatdate
> 
> Parse the Received: headers for the MTA date-stamps -> $fromdates[]
> 
> Compare the values in $fromdates array with $whatdate.
> 
> If the variance is "too high", then ignore the $whatdate, and take
> the, errr, first?, average?, $fromdates[].

does it matter so long as your consistent in what you pick/use/calculate?

I would tend to go for the oldest date in any given array of processed dates
as this would seem to be the closest to the likely actual send date.

> 
> No, wait, maybe I should do a variance within the $fromdates in case
> some stupid MTA server has a bad clock?

I would start by setting out a few acceptable boundaries and 'knowns'
for instance:

1. the first mail was sent no earlier than timestampX
	(so any timestamp encountered that is earlier than this is bogus.)	
2. a maximum time an email could be expected to hang out at any given MTA whilst
waiting to be moved on.
	(could be used to drop an outer timestamps [oldest & newest] from a given array of
	timestamps extracted from mail whose difference is to it's 'neighbour' is
	greater than this agreed maximum period.)

> 
> Any advice?

1. don't forget to normalize all found dates in a given mails array of dates
into UTC (if that is even an issue) before doing any actual processing/analysis of
the collected dates.

2. I would consider the date's found in the Date: and/or Sent: headers with the same
brush as any dates found in the Recieved headers - your explanation suggest than no one
header could be construed as being more reliable than another.

3. er there is no 3, unless you consider 'buy a bigger brain' real advice ;-)

> 
> Anybody got a good "variance" function to do what I'm trying to do?
> 
> Am I on the entirely wrong path here?

dunno - but it's another typical Lynch problem that was just too interesting
for me to let slide :-) please do keep us posted as to your progress!

> Sheesh!
> 
> We may just ignore any obviously wrong dates, and process those by
> hand...

indeed anything that is blatantly 'dodgy' with regard to dates is probably easier
to (and more accurately) processed by hand than it is to create some wizzo algo. for
it - it's a matter of getting the number of 'dodgy' down to an acceptable level of course.

> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux