Re: OT: Analysing UTF-8 file contents.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 21 Dec 2021, at 04:36, Michael D. Setzer II via users <users@xxxxxxxxxxxxxxxxxxxxxxx> wrote:

Perhaps there is a utility or program that does this.
Have been working with web pages that have some utf-8 characters include. Came up with a program that process files and creates report files that list all the lines and positions that include utf-8 characaters. Then another file that summaries each character with count. Then finally will do the same and include the utf-8 description of the character.

Found a list of all utf-8 2 byte 3 byte and 4 byte codes.
Turns out what I found was 122357 characters. Unfortuntely, they were on pages that only listed around a 1024? per page, so had to merge it all into a file that turns out to be 4.4M in size....

Example of process.
  218544 allraw.uog    (combination of 64 web pages)
    2000 allraw.uog.out (contains a total of 2000 uft-8 characters)
      28 allraw.uog.out-sum (the 2000 character are 28 uniq ones)
      28 allraw.uog.out-sum2 (list with names)
     633 uog.csv       (I extract 633 lines of contact data)
       7 uog.csv.out   (Only 7 lines with utf-8 characters)
       3 uog.csv.out-sum (Only 3 uniq utf-8 characters
       3 uog.csv.out-sum2 (list with names)
  122357 utf-8codeslook.csv (4.4M file that has hex codes and des)

Example:
uog.csv.out
   131     27 c3b1       [ñ]
   131     51 c3b1       [ñ]
   276     14 c3a5       [å]
   344     18 c381       [Á]
   344     29 c3b1       [ñ]
   344     48 c381       [Á]
   344     59 c3b1       [ñ]
uog.csv.out-sum
      2 c381       [Á]
      1 c3a5       [å]
      4 c3b1       [ñ]
uog.csv.out-sum2
      2 c381       [Á] LATIN CAPITAL LETTER A WITH ACUTE (U+00C1)
      1 c3a5       [å] LATIN SMALL LETTER A WITH RING ABOVE (U+00E5)
      4 c3b1       [ñ] LATIN SMALL LETTER N WITH TILDE (U+00F1)

Those are real simple.
The all file has 28 characters that include some strange ones.
      5 e2808b     [​] ZERO WIDTH SPACE (U+200B)
      1 e28092     [‒] FIGURE DASH (U+2012)
     44 e28093     [–] EN DASH (U+2013)
      2 e28094     [—] EM DASH (U+2014)
Not clear what the Zero Wicth Space is for?
The other 3 here all look the same to me??

Guess just needed to do something. Interesting result.
Progam takes the filename as input and creates the other 3 files.
If utf-8codeslook.csv in directory it creates the sum2 otherwise skips it. Nice to have long description on some?
File is 4.4M but compresses to 510K as .xz file.

Program findnoascii4.cpp
#include <cstdio>
#include <cstring>
#include <cctype>
#include <cstdlib>

using namespace std;
void testlook(char filename[20]);
int main(int argc,char* argv[])
{
       FILE *fp1,*fp2,*fp3;
       char line[32000],fileout[80],summary[120];
       char code[8],codedes[500],*p1,utf8[8],utf8xchar[8],filename[80],filename2[80];
       int count,x;
       unsigned char c1,c2,c3,c4;
       size_t i;
       int j=0;
       if (argc<2)
       {
              printf("Need File name??");
              exit(1);
       }
       fp1=fopen(argv[1],"r");
       strcpy(fileout,argv[1]);
       strcat(fileout,".out");
       fp2=fopen(fileout,"w");
       while(!feof(fp1))
       {
              fgets(line,32000,fp1);
              j++;
              if(feof(fp1)) break;
              if(strlen(line)<4) continue;
              for(i=0;i<(strlen(line)-3);i++)
              {
                    if(line[i]<=0)
                    {
                                  c1=256+line[i];
                                  c2=256+line[i+1];
                                  c3=256+line[i+2];
                                  c4=256+line[i+3];
                                  switch(c1)
                                  {
                                         case 194 ... 223:
                                                fprintf(fp2,"%6d %6ld %2.2x%2.2x       [%c%c]\n",j,(long)i, c1,c2,c1,c2);
                                                i++;
                                                break;
                                         case 224 ... 239:
                                                fprintf(fp2,"%6d %6ld %2.2x%2.2x%2.2x     [%c%c%c]\n",j,(long)i, c1,c2,c3,c1,c2,c3);
                                                i++;
                                                i++;
                                                break;
                                         case 240 ... 244:
                                                fprintf(fp2,"%6d %6ld %2.2x%2.2x%2.2x%2.2x   [%c%c%c%c]\n",j,(long)i, c1,c2,c3,c4,c1,c2,c3,c4);
                                                i++;
                                                i++;
                                                i++;
                                                break;
                                  }
                    }
              }
       }
       fclose(fp1); fclose(fp2);
       sprintf(summary,"cut -b 15-30 <%s | sort | uniq -c >%s-sum",fileout,fileout);
       system(summary);
       if(!((fp1=fopen("utf-8codeslook.csv","r")))) return 0;
       sprintf(summary,"%s-sum %s-sum2",fileout,fileout);
       sscanf(summary,"%s %s", filename,filename2);
       fp2=fopen(filename,"r");
       fp3=fopen(filename2,"w");
       while(!feof(fp2))
       {
              x=fscanf(fp2,"%d %s %s",&count,utf8,utf8xchar);
              if(x<0) break;
              fp1=fopen("utf-8codeslook.csv","r");
              while(1)
              {
                    fscanf(fp1,"%[^;];%[^\n] ",code,codedes);
                    p1=strstr(code,utf8);
                    if(p1!=NULL) break;
                    if(feof(fp1)) break;
              }
              fprintf(fp3,"%7d %-10s %3s\t%s\n",count,code,utf8xchar,codedes);
              fclose(fp1);
       }
       fclose(fp2); fclose(fp3);
       return 0;
}

Perhaps someone else would find it useful, or perhaps something exist that does something similar that I wasn't able to find. Some mentioned have run across weird files with utf-8. Seems to work for what I want.
Was fun figuring it out.
Thanks for your time.
Would be happy to make utf-8codeslook.xz file available since it was a pain to add all the data from over 100 pages. Could find a single page with the data??
First 5 lines
c280;<control> (U+0080)
c281;<control> (U+0081)
c282;BREAK PERMITTED HERE (U+0082)
c283;NO BREAK HERE (U+0083)
c284;<control> (U+0084)
Some descriptions are almost 500 characters??

I use python to check that files are in a particular encoding.

$ python3
 >>> x = open('the file', 'rb').read()
 >>> x.decode('utf-8')

If there is a problem you will be told the offset in the string an the byte that is the problem.

Note that it is a very common coding error in Windows to assume that HTML is being written in
UTF-8 but in fact it is cp1252, or related, windows code page. In fact this is so common that
its the W3C recommend that is utf-8 decode fails for a HTML page the browser automatically
falls back to cp1252.

Typically you see that the byte that will not decode is 0xa0 that is a quote in cp1252, but not allowed
in utf-8.

To check for cp1252 use:

 >>> x.decode('cp1252')

See https://docs.python.org/3/howto/unicode.html for details of the API for unicode.
Python has done the work for turning the Unicode specification into code and data you
can use.

Hope that helps you explore the data.

Barry

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux