Having strange result on processing UTF-8 file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've spent a number of hours trying all kinds of things I've found on web, but not getting anywhere. Probable something simple.

Download 64 web pages into a single file using wget2. That is fine.

file allraw.uog
allraw.uog: HTML document, UTF-8 Unicode text, with very long lines
File is about 13M (have no control of the source file)
Have a simple CPP program that files lines that have special utf-8 characters. Would extract that code and printed output to screen directly and shows correct utf characters. But If I redirect file to file name and open it, many of the utf-8 characters show up as wrong extended ascii character for first byte and then weird code? Both in gedit and geany??
Modified program to write output directly to a file and if I use cat the output displays the correct utf-8, but again if I open file in gedit or geany it shows a a corrupted mix of extended ascii??
 
$ ./findnoascii2 allraw.uog
Think this is the issue, but no ideal how to fix it.
$ file allraw.uog.out
allraw.uog.out: Non-ISO extended-ASCII text

The file actually contains the correct utf-8 data, and looking at it with hexedit shows it, but both geany and gedit open the file as extended ASCII insteat of UTF-8. Changing the encoding afterward to UTF-8 does nothing.
Don't se options? Again, probable something simple..

Thanks.
Using cat to display out is fine.
Line number position in line hexcode of first chacter then character and a file more characters.
 1881   110 c2bb   »&nbsp;<s
 1881   196 c2bb    »&nbsp;
 2266   285 c2a0    L. <span
 2266   879 e2809c “Communi
 2266   954 e2809d ” of the
 3090   556 e280ba ›</a></l
 3090   655 c2bb   »</a></li
 3134    46 c8a7   ȧt</span>
 3134    83 c3a5   åhan</spa
 3245   150 c2a9      ©</a>

Same lines from geany?
 1881   110 c2bb   »&nbsp;<s
 1881   196 c2bb    »&nbsp;
 2266   285 c2a0    L. <span
 2266   879 e2809c â??Communi
 2266   954 e2809d â? of the
 3090   556 e280ba â?º</a></l
 3090   655 c2bb   »</a></li
 3134    46 c8a7   ȧt</span>
 3134    83 c3a5   Ã¥han</spa
 3245   150 c2a9      ©</a>

Thanks...
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux