Re: OT: Analysing UTF-8 file contents.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On 22 Dec 2021, at 19:50, Michael D. Setzer II <mikes@xxxxxxxx> wrote:
> 
> On 22 Dec 2021 at 18:51, Barry Scott wrote:
> 
> From:    Barry Scott <barry@xxxxxxxxxxxxxxxx>
> Subject:    Re: OT: Analysing UTF-8 file contents.
> Date sent:    Wed, 22 Dec 2021 18:51:46 +0000
> To:    Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>
> Copies to:    "Michael D. Setzer II" <mikes@xxxxxxxx>
> Send reply to:    Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>
> 
> Cut long original message.
> 
>> I use python to check that files are in a particular encoding.
>> 
>> $ python3
>>>>> x = open('the file', 'rb').read()
>>>>> x.decode('utf-8')
>> 
>> If there is a problem you will be told the offset in the string an the byte that is the 
>> problem.
>> 
>> Note that it is a very common coding error in Windows to assume that HTML is 
>> being written in
>> UTF-8 but in fact it is cp1252, or related, windows code page. In fact this is so 
>> common that
>> its the W3C recommend that is utf-8 decode fails for a HTML page the browser 
>> automatically
>> falls back to cp1252.
>> 
>> Typically you see that the byte that will not decode is 0xa0 that is a quote in 
>> cp1252, but not allowed
>> in utf-8.
>> 
>> To check for cp1252 use:
>> 
>>>>> x.decode('cp1252')
>> 
>> Seehttps://docs.python.org/3/howto/unicode.htmlfor details of the API for unicode.
>> Python has done the work for turning the Unicode specification into code and data 
>> you
>> can use.
>> 
>> Hope that helps you explore the data.
>> 
>> Barry
>> 
> Thanks. I had just focused on the utf8 code. Since it starts 
> at c280 - there are actually a number of characters 
> between 80-c1 that wouldn't be covered.

You mean that the day you have is therefore not encoded I utf-8?
It could be that it’s cp1252?

Or as suggested already bytes of utf-8 and an 8 bit codec smashed together.
That would be a bug in the generation of the web page.

Barry
> 
> Thanks for the info..
> 
> 
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux