Michael A. Peters wrote: > I have absolutely no control over the source file. > > The source file is an xml file (er, sort of, it doesn't follow any > particular DTD) and has a tag called VERBATIM_DATE in each record - > looks to be required in their output as every record so far has it, but > w/o a DTD hard to know - time of day, on the other hand, is not required > and sometimes (usually) the tag missing. > > Here's the beauty - VERBATIM_DATE in the same xml file uses multiple > different formats. IE - > > 12 March 1945 > 14 Mar 1967 > Apr 1999 > 12-03-2005 > Before 1904 > Winter or Spring 1977 > > etc. > > It does seem that if there is a day, the day is always first - but > sometimes it has a space as a delimiter, - as delimiter, and sometimes > it has both - IE > > 10-15 Dec 1934 > 12 March-03 April 1956 > > What I'm trying to do is write a preg matches for each case I come > across - if it matches the preg, it then parses according to the pattern > to get me an acceptable YYYY-MM-DD (not sure how I'll deal with the > season case yet ... but I'm serious, that kind of thing in there several > times) > > To at least get started though, is there a wildcard defined that says > match a month? > > IE > > /^([0-9]{2})[\s-](MONTH_MATCH)[\s-]([0-9]{4,4}$/ > > where MONTH is some special magic that matches Mar March Apr April etc. ? > > If you must know - it's data from a biology vertebrate museum. Thousands > of records may match a given query. Most of them look fairly easily > parsable and it does look like when a day is specified, it is always > first and year is always last. > > The data is needed by me, so I'm planning on having the script die if it > comes across a date I don't have a regex to parse before it does > anything so I can add appropriate regex as necessary, but damn - you'd > think a vertebrate museum would have cleaned up their DB somewhat. My first shot would be to see how far I get with strtotime(), or date_create(). The rest looks like a job for the Mechanical Turk (http://www.mturk.com/mturk). For your specific query, you could do something like (Jan|January|Feb|February|...) alternation, but that won't catch typos and idiosyncrasies. You probably want to make it case-insensitive too. I suspect you will end up with a bunch of records where the data cannot be parsed sensibly - I would probably write the list of such records to an exception file. Once you have a a system that generates a manageable number of exceptions you can deal with those by hand. As for your expectation of a museum: the reputation of "dusty old rooms full of stuff" is not entirely un-earned, so I wouldn't expect their databases to be spotless! -- Peter Ford phone: 01580 893333 Developer fax: 01580 893399 Justcroft International Ltd., Staplehurst, Kent -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php