Perhaps there is a utility or program that does this.
Have been working with web pages that have some utf-8
characters include. Came up with a program that process
files and creates report files that list all the lines and
positions that include utf-8 characaters. Then another file
that summaries each character with count. Then finally
will do the same and include the utf-8 description of the
character.
Found a list of all utf-8 2 byte 3 byte and 4 byte codes.
Turns out what I found was 122357 characters.
Unfortuntely, they were on pages that only listed around
a 1024? per page, so had to merge it all into a file that
turns out to be 4.4M in size....
Example of process.
218544 allraw.uog (combination of 64 web pages)
2000 allraw.uog.out (contains a total of 2000 uft-8 characters)
28 allraw.uog.out-sum (the 2000 character are 28 uniq ones)
28 allraw.uog.out-sum2 (list with names)
633 uog.csv (I extract 633 lines of contact data)
7 uog.csv.out (Only 7 lines with utf-8 characters)
3 uog.csv.out-sum (Only 3 uniq utf-8 characters
3 uog.csv.out-sum2 (list with names)
122357 utf-8codeslook.csv (4.4M file that has hex codes and des)
Example:
uog.csv.out
131 27 c3b1 [ñ]
131 51 c3b1 [ñ]
276 14 c3a5 [å]
344 18 c381 [Á]
344 29 c3b1 [ñ]
344 48 c381 [Á]
344 59 c3b1 [ñ]
uog.csv.out-sum
2 c381 [Á]
1 c3a5 [å]
4 c3b1 [ñ]
uog.csv.out-sum2
2 c381 [Á] LATIN CAPITAL LETTER A WITH ACUTE (U+00C1)
1 c3a5 [å] LATIN SMALL LETTER A WITH RING ABOVE (U+00E5)
4 c3b1 [ñ] LATIN SMALL LETTER N WITH TILDE (U+00F1)
Those are real simple.
The all file has 28 characters that include some strange ones.
5 e2808b [] ZERO WIDTH SPACE (U+200B)
1 e28092 [‒] FIGURE DASH (U+2012)
44 e28093 [–] EN DASH (U+2013)
2 e28094 [—] EM DASH (U+2014)
Not clear what the Zero Wicth Space is for?
The other 3 here all look the same to me??
Guess just needed to do something. Interesting result.
Progam takes the filename as input and creates the other 3 files.
If utf-8codeslook.csv in directory it creates the sum2 otherwise
skips it. Nice to have long description on some?
File is 4.4M but compresses to 510K as .xz file.
Program findnoascii4.cpp
#include <cstdio>
#include <cstring>
#include <cctype>
#include <cstdlib>
using namespace std;
void testlook(char filename[20]);
int main(int argc,char* argv[])
{
FILE *fp1,*fp2,*fp3;
char line[32000],fileout[80],summary[120];
char code[8],codedes[500],*p1,utf8[8],utf8xchar[8],filename[80],filename2[80];
int count,x;
unsigned char c1,c2,c3,c4;
size_t i;
int j=0;
if (argc<2)
{
printf("Need File name??");
exit(1);
}
fp1=fopen(argv[1],"r");
strcpy(fileout,argv[1]);
strcat(fileout,".out");
fp2=fopen(fileout,"w");
while(!feof(fp1))
{
fgets(line,32000,fp1);
j++;
if(feof(fp1)) break;
if(strlen(line)<4) continue;
for(i=0;i<(strlen(line)-3);i++)
{
if(line[i]<=0)
{
c1=256+line[i];
c2=256+line[i+1];
c3=256+line[i+2];
c4=256+line[i+3];
switch(c1)
{
case
194 ... 223:
fprintf(fp2,"%6d
%6ld %2.2x%2.2x
[%c%c]\n",j,(long)i, c1,c2,c1,c2);
i++;
break;
case
224 ... 239:
fprintf(fp2,"%6d
%6ld
%2.2x%2.2x%2.2x [%c%c%c]\n",j,(long)i, c1,c2,c3,c1,c2,c3);
i++;
i++;
break;
case
240 ... 244:
fprintf(fp2,"%6d
%6ld
%2.2x%2.2x%2.2x%2.2x [%c%c%c%c]\n",j,(long)i, c1,c2,c3,c4,c1,c2,c3,c4);
i++;
i++;
i++;
break;
}
}
}
}
fclose(fp1); fclose(fp2);
sprintf(summary,"cut -b 15-30 <%s | sort | uniq -c >%s-sum",fileout,fileout);
system(summary);
if(!((fp1=fopen("utf-8codeslook.csv","r")))) return 0;
sprintf(summary,"%s-sum %s-sum2",fileout,fileout);
sscanf(summary,"%s %s", filename,filename2);
fp2=fopen(filename,"r");
fp3=fopen(filename2,"w");
while(!feof(fp2))
{
x=fscanf(fp2,"%d %s %s",&count,utf8,utf8xchar);
if(x<0) break;
fp1=fopen("utf-8codeslook.csv","r");
while(1)
{
fscanf(fp1,"%[^;];%[^\n] ",code,codedes);
p1=strstr(code,utf8);
if(p1!=NULL) break;
if(feof(fp1)) break;
}
fprintf(fp3,"%7d %-10s %3s\t%s\n",count,code,utf8xchar,codedes);
fclose(fp1);
}
fclose(fp2); fclose(fp3);
return 0;
}
Perhaps someone else would find it useful, or perhaps
something exist that does something similar that I wasn't
able to find. Some mentioned have run across weird files
with utf-8. Seems to work for what I want.
Was fun figuring it out.
Thanks for your time.
Would be happy to make utf-8codeslook.xz file available
since it was a pain to add all the data from over 100
pages. Could find a single page with the data??
First 5 lines
c280;<control> (U+0080)
c281;<control> (U+0081)
c282;BREAK PERMITTED HERE (U+0082)
c283;NO BREAK HERE (U+0083)
c284;<control> (U+0084)
Some descriptions are almost 500 characters??
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure