> On 22 Dec 2021, at 19:50, Michael D. Setzer II <mikes@xxxxxxxx> wrote: > > On 22 Dec 2021 at 18:51, Barry Scott wrote: > > From: Barry Scott <barry@xxxxxxxxxxxxxxxx> > Subject: Re: OT: Analysing UTF-8 file contents. > Date sent: Wed, 22 Dec 2021 18:51:46 +0000 > To: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx> > Copies to: "Michael D. Setzer II" <mikes@xxxxxxxx> > Send reply to: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx> > > Cut long original message. > >> I use python to check that files are in a particular encoding. >> >> $ python3 >>>>> x = open('the file', 'rb').read() >>>>> x.decode('utf-8') >> >> If there is a problem you will be told the offset in the string an the byte that is the >> problem. >> >> Note that it is a very common coding error in Windows to assume that HTML is >> being written in >> UTF-8 but in fact it is cp1252, or related, windows code page. In fact this is so >> common that >> its the W3C recommend that is utf-8 decode fails for a HTML page the browser >> automatically >> falls back to cp1252. >> >> Typically you see that the byte that will not decode is 0xa0 that is a quote in >> cp1252, but not allowed >> in utf-8. >> >> To check for cp1252 use: >> >>>>> x.decode('cp1252') >> >> Seehttps://docs.python.org/3/howto/unicode.htmlfor details of the API for unicode. >> Python has done the work for turning the Unicode specification into code and data >> you >> can use. >> >> Hope that helps you explore the data. >> >> Barry >> > Thanks. I had just focused on the utf8 code. Since it starts > at c280 - there are actually a number of characters > between 80-c1 that wouldn't be covered. You mean that the day you have is therefore not encoded I utf-8? It could be that it’s cp1252? Or as suggested already bytes of utf-8 and an 8 bit codec smashed together. That would be a bug in the generation of the web page. Barry > > Thanks for the info.. > > _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure