i18n maybe?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a table like this:
artist_id | artistname  | artistname_alpha
1         | The Doors   |
2         | The The     |
3         | 100 Monkeys |
4         | 3�16   |

That last artistname is not in ASCII/English...  Dunno what your email
client is showing you, but it's:

the digit 3
capital A with umlauts
US cents sign
capital A with carat
question mark
capital A with carat
US cents sign
the digit 1
the digit 6

THAT ought to get through any email client/mta okay. :-)

Now, my goal is to fill in artistname_alpha with things such as:
Doors, The
The, The
one hundred monkeys
3�16 (???)

I've written a nifty function for this:

function alpha ($string){
  //$string = utf8_decode($string);

  $string = preg_replace_callback('/(\\$[0-9\\.]+)/',
create_function('$s', 'return
Numbers_Words::toCurrency(str_replace("$", "", $s[1]));'), $string);
  $string = preg_replace_callback('/([0-9]+)/', create_function('$s',
'return Numbers_Words::toWords($s[1]);'), $string);

  if (stristr(substr($string, 0, 4), 'The ')) return (substr($string,
4) . ', ' . substr($string, 0, 4));
  elseif (stristr(substr($string, 0, 3), 'An ')) return
(substr($string, 3) . ', ' . substr($string, 0, 3));
  elseif (stristr(substr($string, 0, 2), 'A ')) return
(substr($string, 2) . ', ' . substr($string, 0, 2));
  else return $string;
}

Now, the tricky part is that I don't really know what
'3�16' is.

It looks like it might be UTF-8, but utf8_decode() had no effect on
it, which is why I've commented that out in the function.

SO my function currently converts it to:
'three�sixteen'

That ain't right.

So, does anybody who understands this i18n stuff want to clue me in
the right direction?...

Things you should know:

I'm not trying to provide support for anything but English here,
unless it's trivial to do so.

The table has 150,000 rows.

I have no real control over fancy MySQL settings, as it's a $20 shared
host deal.

Every day, at 6 am, I get a new file of this data, and run through
with a script that does an UPDATE or INSERT.  REPLACE is not suitable
due to primary key field size of source data.  Anyway, I haven't even
checked if the function as-is will be too slow, but whatever I do to
fix the i18n issue can't have too much overhead, as it will be called
150,000 times every morning at 6 am.

If it helps, here is what my data-source dumps out when he encounters
this band name:
http://cdbaby.com/cd/316live

Here is the band's web-site:
http://316live.com/

And, here, possibly, is HTML source for what somebody copied/pasted
into the FORM to fill in the band name:

3·16

So, possibly, this is not i18n at all, and just somebody really really
really silly copying and pasting an HTML entity 'middot' from their
website into a form input and expecting it to render...

Would '·' output by a browser turn into 'âÂ�¢' ???

If so, what can I do about it?

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux