On 4 November 2010 15:31, robert mena <robert.mena@xxxxxxxxx> wrote: > Hi Richard, > I am not top posting. ÂI am just explaining other symptoms that may point to > the cause since they may be the same and this is happening with the same > file. ÂI'll try to get approval to release the file. > Meanwhile, In your opinion what would be the safest way to read and explode > (using \t) a text file encoded in UTF-8? > > On Thu, Nov 4, 2010 at 11:22 AM, Richard Quadling <rquadling@xxxxxxxxx> > wrote: >> >> On 4 November 2010 15:11, robert mena <robert.mena@xxxxxxxxx> wrote: >> > Hi, >> > The core of the code is simply >> > $fp = fopen('file.tab', 'rb'); >> > while(!feof($fp)) >> > { >> > ÂÂ $line = fgets($fp); >> > ÂÂ $data = explode("\t", $line); >> > ÂÂ Â... >> > } >> > So I try to manipulate the $data[X]. ÂFor example $data[0] is supposed >> > to be >> > numeric so I Â$n = (int) $data[0] >> > One other thing if the second column should contain a string. ÂIf I >> > check >> > the string visually it is correct but a if( $data[1] == 'stringX') Âis >> > false >> > even if in the file I can see this (and print those two) >> > I even did a md5 of both and they are different. >> > I seems to be an encoding issue. ÂIs it safe to use explode with utf8 >> > strings? >> > I even tried this code but no match found (jst to replace the explode) >> > $str = "abc æååã Â Âefg"; >> > $results = array(); >> > preg_match_all("/\t/u", $str, $results); >> > var_dump($results[0]); >> > On Thu, Nov 4, 2010 at 6:33 AM, Richard Quadling <rquadling@xxxxxxxxx> >> > wrote: >> >> >> >> On 3 November 2010 21:42, Alexander Holodny >> >> <alexander.holodny@xxxxxxxxx> >> >> wrote: >> >> > To exclude unexcepted behavior in case of wrongly formated input >> >> > data, >> >> > it would be much better to use such type-casting method: >> >> > intval(ltrim(trim($inStr), '0')) >> >> > >> >> > 2010/11/3, Nicholas Kell <nick@xxxxxxxxxxxxxxxx>: >> >> >> >> >> >> On Nov 3, 2010, at 4:22 PM, robert mena wrote: >> >> >> >> >> >>> Hi, >> >> >>> >> >> >>> I have a text file (utf-8 encoded) which contains lines with >> >> >>> numbers >> >> >>> and >> >> >>> text separated by \t. ÂI need to convert the numbers that contains >> >> >>> 0 >> >> >>> (at >> >> >>> left) to integers. >> >> >>> >> >> >>> For some reason one line that contains 00000002 is casted to 0 >> >> >>> instead >> >> >>> of >> >> >>> 2. >> >> >>> Bellow the output of the cast (int) $field[0] Âwhere I get this >> >> >>> from >> >> >>> explode each line. >> >> >>> >> >> >>> 0 ï00000002 >> >> >>> 4 00000004 >> >> >> >> >> >> >> >> >> >> >> >> My first guess is wondering how you are grabbing the strings from >> >> >> the >> >> >> file. >> >> >> Seems to me like it would just drop the zeros on the left by >> >> >> default. >> >> >> Are >> >> >> you including the \t in the string by accident? If so, that may be >> >> >> hosing >> >> >> it. Otherwise, have you tried ltrim on it? >> >> >> >> >> >> Ex: >> >> >> >> >> >> $_castableString = ltrim($_yourString, '0'); >> >> >> >> >> >> // Now cast >> >> >> >> <?php >> >> // Create test file. >> >> $s_TabbedFilename = './test.tab'; >> >> file_put_contents($s_TabbedFilename, "0\t00000002" . PHP_EOL . >> >> "4\t00000004" . PHP_EOL); >> >> >> >> // Open test file. >> >> $fp_TabbedFile = fopen($s_TabbedFilename, 'rt') or die("Could not open >> >> {$s_TabbedFilename}\n"); >> >> >> >> // Iterate file. >> >> while(True) >> >> Â Â Â Â{ >> >> Â Â Â Âif (False !== ($a_Line = fgetcsv($fp_TabbedFile, 0, "\t"))) >> >> Â Â Â Â Â Â Â Â{ >> >> Â Â Â Â Â Â Â Âvar_dump($a_Line); >> >> Â Â Â Â Â Â Â Âforeach($a_Line as $i_Index => $m_Value) >> >> Â Â Â Â Â Â Â Â Â Â Â Â{ >> >> Â Â Â Â Â Â Â Â Â Â Â Â$a_Line[$i_Index] = intval($m_Value); >> >> Â Â Â Â Â Â Â Â Â Â Â Â} >> >> Â Â Â Â Â Â Â Âvar_dump($a_Line); >> >> Â Â Â Â Â Â Â Â} >> >> Â Â Â Âelse >> >> Â Â Â Â Â Â Â Â{ >> >> Â Â Â Â Â Â Â Âbreak; >> >> Â Â Â Â Â Â Â Â} >> >> Â Â Â Â} >> >> >> >> // Close the file. >> >> fclose($fp_TabbedFile); >> >> >> >> // Delete the file. >> >> unlink($s_TabbedFilename); >> >> >> >> >> >> outputs ... >> >> >> >> array(2) { >> >> Â[0]=> >> >> Âstring(1) "0" >> >> Â[1]=> >> >> Âstring(8) "00000002" >> >> } >> >> array(2) { >> >> Â[0]=> >> >> Âint(0) >> >> Â[1]=> >> >> Âint(2) >> >> } >> >> array(2) { >> >> Â[0]=> >> >> Âstring(1) "4" >> >> Â[1]=> >> >> Âstring(8) "00000004" >> >> } >> >> array(2) { >> >> Â[0]=> >> >> Âint(4) >> >> Â[1]=> >> >> Âint(4) >> >> } >> >> >> >> intval() operates as standard on base 10, so no need to worry about >> >> leading zeros' being thought of as base8/octal. >> >> >> >> What is your code? Can you reduce it to something as small like the >> >> above to see if you can repeat the issue? >> >> Please don't top post. >> >> >> With regards to utf-8 data, no, PHP is not unicode aware. >> >> If a multi-byte character is comprised of a 0x09 byte, then it will be >> broken. >> >> Can you supply the file you are working on? >> >> b64encode it and drop it into a pastebin. >> >> >> -- >> Richard Quadling >> Twitter : EE : Zend >> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY > > I've not used it, but the mbstring extension has mb_split() - Split multibyte string using regular expression Whilst it probably isn't as performant as explode() or fgetcsv(), it should work. But I'm not an unicode expert and having a file I can test this mechanism easily enough. I'd be interested in knowing what output the code I produced outputs when used in conjunction with your data. -- Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php