On 22 Dec 2021 at 18:51, Barry Scott wrote: From: Barry Scott <barry@xxxxxxxxxxxxxxxx> Subject: Re: OT: Analysing UTF-8 file contents. Date sent: Wed, 22 Dec 2021 18:51:46 +0000 To: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx> Copies to: "Michael D. Setzer II" <mikes@xxxxxxxx> Send reply to: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx> Cut long original message. > I use python to check that files are in a particular encoding. > > $ python3 > >>> x = open('the file', 'rb').read() > >>> x.decode('utf-8') > > If there is a problem you will be told the offset in the string an the byte that is the > problem. > > Note that it is a very common coding error in Windows to assume that HTML is > being written in > UTF-8 but in fact it is cp1252, or related, windows code page. In fact this is so > common that > its the W3C recommend that is utf-8 decode fails for a HTML page the browser > automatically > falls back to cp1252. > > Typically you see that the byte that will not decode is 0xa0 that is a quote in > cp1252, but not allowed > in utf-8. > > To check for cp1252 use: > > >>> x.decode('cp1252') > > Seehttps://docs.python.org/3/howto/unicode.htmlfor details of the API for unicode. > Python has done the work for turning the Unicode specification into code and data > you > can use. > > Hope that helps you explore the data. > > Barry > Thanks. I had just focused on the utf8 code. Since it starts at c280 - there are actually a number of characters between 80-c1 that wouldn't be covered. Thanks for the info.. _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure