OT: Analysing UTF-8 file contents.

"Michael D. Setzer II via users" <users@xxxxxxxxxxxxxxxxxxxxxxx> · Tue, 21 Dec 2021 14:36:09 +1000

Perhaps there is a utility or program that does this.
Have been working with web pages that have some utf-8
characters include. Came up with a program that process
files and creates report files that list all the lines and
positions that include utf-8 characaters. Then another file
that summaries each character with count. Then finally
will do the same and include the utf-8 description of the
character.

Found a list of all utf-8 2 byte 3 byte and 4 byte codes.
Turns out what I found was 122357 characters.
Unfortuntely, they were on pages that only listed around
a 1024? per page, so had to merge it all into a file that
turns out to be 4.4M in size.... 

Example of process.
  218544 allraw.uog    (combination of 64 web pages)
    2000 allraw.uog.out (contains a total of 2000 uft-8 characters)
      28 allraw.uog.out-sum (the 2000 character are 28 uniq ones)
      28 allraw.uog.out-sum2 (list with names)
     633 uog.csv       (I extract 633 lines of contact data)
       7 uog.csv.out   (Only 7 lines with utf-8 characters)
       3 uog.csv.out-sum (Only 3 uniq utf-8 characters
       3 uog.csv.out-sum2 (list with names)
  122357 utf-8codeslook.csv (4.4M file that has hex codes and des)

Example:
uog.csv.out 
   131     27 c3b1       [ñ]
   131     51 c3b1       [ñ]
   276     14 c3a5       [å]
   344     18 c381       [Á]
   344     29 c3b1       [ñ]
   344     48 c381       [Á]
   344     59 c3b1       [ñ]
uog.csv.out-sum
      2 c381       [Á]
      1 c3a5       [å]
      4 c3b1       [ñ]
uog.csv.out-sum2
      2 c381       [Á] LATIN CAPITAL LETTER A WITH ACUTE (U+00C1) 
      1 c3a5       [å] LATIN SMALL LETTER A WITH RING ABOVE (U+00E5) 
      4 c3b1       [ñ] LATIN SMALL LETTER N WITH TILDE (U+00F1) 

Those are real simple.
The all file has 28 characters that include some strange ones.
      5 e2808b     [] ZERO WIDTH SPACE (U+200B) 
      1 e28092     [‒] FIGURE DASH (U+2012) 
     44 e28093     [–] EN DASH (U+2013) 
      2 e28094     [—] EM DASH (U+2014) 
Not clear what the Zero Wicth Space is for?
The other 3 here all look the same to me??

Guess just needed to do something. Interesting result.
Progam takes the filename as input and creates the other 3 files.
If utf-8codeslook.csv in directory it creates the sum2 otherwise
skips it. Nice to have long description on some?
File is 4.4M but compresses to 510K as .xz file.

Program findnoascii4.cpp
#include <cstdio>
#include <cstring>
#include <cctype>
#include <cstdlib>

using namespace std;
void testlook(char filename[20]);
int main(int argc,char* argv[])
{
       FILE *fp1,*fp2,*fp3;
       char line[32000],fileout[80],summary[120];
       char code[8],codedes[500],*p1,utf8[8],utf8xchar[8],filename[80],filename2[80];
       int count,x;
       unsigned char c1,c2,c3,c4;
       size_t i;
       int j=0;
       if (argc<2)
       {
              printf("Need File name??");
              exit(1);
       }
       fp1=fopen(argv[1],"r");
       strcpy(fileout,argv[1]);
       strcat(fileout,".out");
       fp2=fopen(fileout,"w");
       while(!feof(fp1))
       {
              fgets(line,32000,fp1);
              j++;
              if(feof(fp1)) break;
              if(strlen(line)<4) continue;
              for(i=0;i<(strlen(line)-3);i++)
              {
                    if(line[i]<=0)
                    {
                                  c1=256+line[i];
                                  c2=256+line[i+1];
                                  c3=256+line[i+2];
                                  c4=256+line[i+3];
                                  switch(c1)
                                  {
                                         case
194 ... 223:
                                                fprintf(fp2,"%6d
%6ld %2.2x%2.2x      
[%c%c]\n",j,(long)i, c1,c2,c1,c2);
                                                i++;
                                                break;
                                         case
224 ... 239:
                                                fprintf(fp2,"%6d
%6ld
%2.2x%2.2x%2.2x     [%c%c%c]\n",j,(long)i, c1,c2,c3,c1,c2,c3);
                                                i++;
                                                i++;
                                                break;
                                         case
240 ... 244:
                                                fprintf(fp2,"%6d
%6ld
%2.2x%2.2x%2.2x%2.2x   [%c%c%c%c]\n",j,(long)i, c1,c2,c3,c4,c1,c2,c3,c4);
                                                i++;
                                                i++;
                                                i++;
                                                break;
                                  }
                    }
              }
       }
       fclose(fp1); fclose(fp2);
       sprintf(summary,"cut -b 15-30 <%s | sort | uniq -c >%s-sum",fileout,fileout);
       system(summary);
       if(!((fp1=fopen("utf-8codeslook.csv","r")))) return 0;
       sprintf(summary,"%s-sum %s-sum2",fileout,fileout);
       sscanf(summary,"%s %s", filename,filename2);
       fp2=fopen(filename,"r");
       fp3=fopen(filename2,"w");
       while(!feof(fp2))
       {
              x=fscanf(fp2,"%d %s %s",&count,utf8,utf8xchar);
              if(x<0) break;
              fp1=fopen("utf-8codeslook.csv","r");
              while(1)
              {
                    fscanf(fp1,"%[^;];%[^\n] ",code,codedes);
                    p1=strstr(code,utf8);
                    if(p1!=NULL) break;
                    if(feof(fp1)) break;
              }
              fprintf(fp3,"%7d %-10s %3s\t%s\n",count,code,utf8xchar,codedes);
              fclose(fp1);
       }
       fclose(fp2); fclose(fp3);
       return 0;
}

Perhaps someone else would find it useful, or perhaps
something exist that does something similar that I wasn't
able to find. Some mentioned have run across weird files
with utf-8. Seems to work for what I want. 
Was fun figuring it out. 
Thanks for your time.
Would be happy to make utf-8codeslook.xz file available
since it was a pain to add all the data from over 100
pages. Could find a single page with the data??
First 5 lines
c280;<control> (U+0080) 
c281;<control> (U+0081) 
c282;BREAK PERMITTED HERE (U+0082) 
c283;NO BREAK HERE (U+0083) 
c284;<control> (U+0084) 
Some descriptions are almost 500 characters??

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure