Re: Check the byte sequence of a file to tell if it is UTF-8 without the BOM using PHP ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Adam,

I have a prof that the XML advise does not work in real cases I had.
We are using XMLs in our system but when you edit the XML with a text editor and put the XML heading of UTF-8
<?xml version="1.0" encoding="UTF-8"?>

it DOES NOT assure the text inside is encoded in UTF-8 so but maybe (many cases) t other iso-xxx method.

My question was for a function that scan the bytes of the file and decided WITHOUT the BOM heading.
I mean by checking the bytes sequence in the file.

I claim that WITHOUT a BOM it might be impossible to assure it is UTF-8 encoding which is a whole escape sequence logic
that may convert one character into one, two or three character.

Any advise if I'm right on this or smart file scan function that makes it?

Eli
On 21/05/2011 20:03, Adam Richardson wrote:
On Sat, May 21, 2011 at 12:10 PM, Eli Orr (Office) <eli.orr@xxxxxxxxxxxx <mailto:eli.orr@xxxxxxxxxxxx>> wrote:


    Dear PHP Gurus,

    I have a debate on the following please let me know what is true /
    false.

    I'am using a PHP function *is_UTF_8_file ($file_name) *that I've
    found as part of my PHP 5.3 installation.
    This function checks if the file start with the 3 UTF-8 BOM bytes.

    However another guy told me that there is way to detect if a file
    is a UTF-8 without having the BOM at the file start.
    To me it sounds impossible since if you do not have this
    indication you have a stream of bytes that you can never tell 100%
    if that is UTF-8 or else.

    Who is rigt here ?
    If there is a Magical function that can detect files without a BOM
    if they are UTF-8 or not please share you knowledge if this
    is not a "NULL" or impossible function as I thought.


Here's a great write-up I've got bookmarked (he points out Windows Notepad automatically determines the encoding):
http://codesnipers.com/?q=node/68

    * If it's an XML file, the structure allows you determine the
      encoding.
    * For other files, you can encode it as UTF-8 and look for
      improper encodings.


As far as a PHP function that already does this, I'm not aware of it, but you could make a system call to "file" if your on Linux, as it tries to automatically determine the encoding:
http://linux.die.net/man/1/file

Adam

--
Nephtali:  A simple, flexible, fast, and security-focused PHP framework
http://nephtaliproject.com


--
Best Regards,

*Eli Orr*
CTO & Founder
*LogoDial Ltd.*
M:+972-54-7379604
O:+972-74-703-2034
F: +972-77-3379604

Plaut 10, Rehovot, Israel
Email: _Eli.Orr@xxxxxxxxxxxxx
Skype: _eliorr.com_

[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux