Re: preg_match and dates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michael A. Peters wrote:
> I have absolutely no control over the source file.
> 
> The source file is an xml file (er, sort of, it doesn't follow any
> particular DTD) and has a tag called VERBATIM_DATE in each record -
> looks to be required in their output as every record so far has it, but
> w/o a DTD hard to know - time of day, on the other hand, is not required
> and sometimes (usually) the tag missing.
> 
> Here's the beauty - VERBATIM_DATE in the same xml file uses multiple
> different formats. IE -
> 
> 12 March 1945
> 14 Mar 1967
> Apr 1999
> 12-03-2005
> Before 1904
> Winter or Spring 1977
> 
> etc.
> 
> It does seem that if there is a day, the day is always first - but
> sometimes it has a space as a delimiter, - as delimiter, and sometimes
> it has both - IE
> 
> 10-15 Dec 1934
> 12 March-03 April 1956
> 
> What I'm trying to do is write a preg matches for each case I come
> across - if it matches the preg, it then parses according to the pattern
> to get me an acceptable YYYY-MM-DD (not sure how I'll deal with the
> season case yet ... but I'm serious, that kind of thing in there several
> times)
> 
> To at least get started though, is there a wildcard defined that says
> match a month?
> 
> IE
> 
> /^([0-9]{2})[\s-](MONTH_MATCH)[\s-]([0-9]{4,4}$/
> 
> where MONTH is some special magic that matches Mar March Apr April etc. ?
> 
> If you must know - it's data from a biology vertebrate museum. Thousands
> of records may match a given query. Most of them look fairly easily
> parsable and it does look like when a day is specified, it is always
> first and year is always last.
> 
> The data is needed by me, so I'm planning on having the script die if it
> comes across a date I don't have a regex to parse before it does
> anything so I can add appropriate regex as necessary, but damn - you'd
> think a vertebrate museum would have cleaned up their DB somewhat.


My first shot would be to see how far I get with strtotime(), or date_create().
The rest looks like a job for the Mechanical Turk (http://www.mturk.com/mturk).

For your specific query, you could do something like
(Jan|January|Feb|February|...) alternation, but that won't catch typos and
idiosyncrasies. You probably want to make it case-insensitive too.

I suspect you will end up with a bunch of records where the data cannot be
parsed sensibly - I would probably write the list of such records to an
exception file. Once you have a a system that generates a manageable number of
exceptions you can deal with those by hand.

As for your expectation of a museum: the reputation of "dusty old rooms full of
stuff" is not entirely un-earned, so I wouldn't expect their databases to be
spotless!

-- 
Peter Ford                              phone: 01580 893333
Developer                               fax:   01580 893399
Justcroft International Ltd., Staplehurst, Kent

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux