Re: Handling illegal byte sequences in UTF-8 strings

"Richard Lynch" <ceo@xxxxxxxxx> · Sat, 22 Apr 2006 02:13:04 -0500 (CDT)

On Fri, April 21, 2006 7:16 pm, Matt Arnilo S. Baluyos (Mailing Lists)
wrote:
> We have recently upgraded our database to PostgreSQL 8.1.x which
> handles UTF-8 more strictly than previous versions. The new version
> will not allow illegal byte sequences when inserting data.
>
> This has caused some errors in our system which inputs data.
> Basically, what the system does is insert data which is copy-pasted
> from OpenOffice.org files. The content of the OpenOffice.org files are
> likewise pasted from various websites which may or may not be using
> UTF-8 encoding.
>
> After some research, I have looked at both iconv and mbstring (I might
> use iconv since it's there by default). But nonetheless, someone on
> the list may have a better way of handling this issue.
>
> What then would be the best way to handle illegal byte sequences
> before they are inserted into the database?

I guess the big question would be this:

Where do you intend to output these strings?

Are they going to end up in UTF-8 HTML output?

Or are they going to end up in Unicode (UTF-16+) documents?

Or are you stuck with Latin-1 HTML output for legacy reasons?

Going at it from the other side...

A *LOT* of MS Office (Word) users will end up copying and pasting
stuff that just plain is NOT any kind of standard at all.

They're internal Word formatted characters that have no meaning
whatsoever in any world other than MS Word.

I suspect OpenOffice *might* be acting Word-compatible in this regard.

If you've got THOSE coming in, there are some functions in the User
Contributed Notes of str_replace that will let you convert funky crap
MS Word only characters into their closest moral equivalent HTML
Entity.

Of, if they ARE supposed to be valid UTF-8 characters, but there's a
bug in OpenOffice, well, obviously, you need a work-around TODAY, but
file a bug report too, so it can be fixed for tomorrow.

I doubt that anybody can really advise you without seeing the actual
characters (byte for byte) you are receiving.

And you may want to compare what the user is seeing in OpenOffice with
what you are getting and what output you want -- Because until you've
defined what they "see", what they give you, and what you want, you're
pretty much just guessing in the dark what you want to do.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php