hi Richard, your email was hard to follow, and I don't have real answers for you but maybe my simpleton's view of the situation might offer you new avenues of thought to consider. Richard Lynch wrote: > It's been 20+ years since I took a stats class... 20 years ago I was mostly riding a push bike ... and I've never taken a stats class as such (bare this in mind :-) > > I didn't enjoy that class, and doubt if I remember 1% of what was > covered. you'd be 1 up on me ;-) > ... > > And the sheer number of functions in the stats package is making my > head spin. > ... > > Some fools have their PC clock set to, like, 1970 or whatever. So > let's be generous and assume their CMOS battery has died, and they > haven't had a chance to change it. Fine. Deal with it. > > Okay, so *NOW* the algorithm is to do this: > > Take the Date: header, or Sent: header if no Date: header -> $whatdate > > Parse the Received: headers for the MTA date-stamps -> $fromdates[] > > Compare the values in $fromdates array with $whatdate. > > If the variance is "too high", then ignore the $whatdate, and take > the, errr, first?, average?, $fromdates[]. does it matter so long as your consistent in what you pick/use/calculate? I would tend to go for the oldest date in any given array of processed dates as this would seem to be the closest to the likely actual send date. > > No, wait, maybe I should do a variance within the $fromdates in case > some stupid MTA server has a bad clock? I would start by setting out a few acceptable boundaries and 'knowns' for instance: 1. the first mail was sent no earlier than timestampX (so any timestamp encountered that is earlier than this is bogus.) 2. a maximum time an email could be expected to hang out at any given MTA whilst waiting to be moved on. (could be used to drop an outer timestamps [oldest & newest] from a given array of timestamps extracted from mail whose difference is to it's 'neighbour' is greater than this agreed maximum period.) > > Any advice? 1. don't forget to normalize all found dates in a given mails array of dates into UTC (if that is even an issue) before doing any actual processing/analysis of the collected dates. 2. I would consider the date's found in the Date: and/or Sent: headers with the same brush as any dates found in the Recieved headers - your explanation suggest than no one header could be construed as being more reliable than another. 3. er there is no 3, unless you consider 'buy a bigger brain' real advice ;-) > > Anybody got a good "variance" function to do what I'm trying to do? > > Am I on the entirely wrong path here? dunno - but it's another typical Lynch problem that was just too interesting for me to let slide :-) please do keep us posted as to your progress! > Sheesh! > > We may just ignore any obviously wrong dates, and process those by > hand... indeed anything that is blatantly 'dodgy' with regard to dates is probably easier to (and more accurately) processed by hand than it is to create some wizzo algo. for it - it's a matter of getting the number of 'dodgy' down to an acceptable level of course. > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php