On Tue, 21 Dec 2021 at 00:37, Michael D. Setzer II via users <users@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
Perhaps there is a utility or program that does this.
Microsoft's Code editor (available in Fedora) just added highlighting of non-ASCII characters in ASCII file formats.
I have been using Python's set() to detect non-ASCII in program fragments.
A leading reason for users reporting "it doesn't work"
In user forums for the
remote-sensing software I use is non-ASCII characters provided by overly
helpful editors. Remote sensing is being used world-wide, often by people trained
in agronomy, fisheries, etc. whose prior computing experience was MS Word and
Excel. If you ask them to post the text of the problem script or configuration
file, they often post a screen capture.
Have been working with web pages that have some utf-8 characters include. Came up with a program that process files and creates report files that list all the lines and positions that include utf-8 characaters. Then another file that summaries each character with count. Then finally will do the same and include the utf-8 description of the character.
Found a list of all utf-8 2 byte 3 byte and 4 byte codes.Turns out what I found was 122357 characters. Unfortuntely, they were on pages that only listed around a 1024? per page, so had to merge it all into a file that turns out to be 4.4M in size....
Example of process.218544 allraw.uog (combination of 64 web pages)2000 allraw.uog.out (contains a total of 2000 uft-8 characters)28 allraw.uog.out-sum (the 2000 character are 28 uniq ones)28 allraw.uog.out-sum2 (list with names)633 uog.csv (I extract 633 lines of contact data)7 uog.csv.out (Only 7 lines with utf-8 characters)3 uog.csv.out-sum (Only 3 uniq utf-8 characters3 uog.csv.out-sum2 (list with names)122357 utf-8codeslook.csv (4.4M file that has hex codes and des)
Example:uog.csv.out131 27 c3b1 [ñ]131 51 c3b1 [ñ]276 14 c3a5 [å]344 18 c381 [Á]344 29 c3b1 [ñ]344 48 c381 [Á]344 59 c3b1 [ñ]uog.csv.out-sum2 c381 [Á]1 c3a5 [å]4 c3b1 [ñ]uog.csv.out-sum22 c381 [Á] LATIN CAPITAL LETTER A WITH ACUTE (U+00C1)1 c3a5 [å] LATIN SMALL LETTER A WITH RING ABOVE (U+00E5)4 c3b1 [ñ] LATIN SMALL LETTER N WITH TILDE (U+00F1)
Those are real simple.
In programming languages you often see author or place names in comments, so a general
utility needs a way to exclude comments.
The all file has 28 characters that include some strange ones.5 e2808b [] ZERO WIDTH SPACE (U+200B)1 e28092 [‒] FIGURE DASH (U+2012)44 e28093 [–] EN DASH (U+2013)2 e28094 [—] EM DASH (U+2014)Not clear what the Zero Wicth Space is for?The other 3 here all look the same to me??
In my experience, editors and PDF workflows often replace ASCII single and double
quote pairs are often replaced by opening and closing versions. Many of the programs
I use come with PDF documentation and tutorials with code fragments that produce
errors when pasted into user code.due to this issue.
Guess just needed to do something. Interesting result.Progam takes the filename as input and creates the other 3 files.If utf-8codeslook.csv in directory it creates the sum2 otherwise skips it. Nice to have long description on some?File is 4.4M but compresses to 510K as .xz file.
Program findnoascii4.cpp#include <cstdio>#include <cstring>#include <cctype>#include <cstdlib>
using namespace std;void testlook(char filename[20]);int main(int argc,char* argv[]){FILE *fp1,*fp2,*fp3;char line[32000],fileout[80],summary[120];char code[8],codedes[500],*p1,utf8[8],utf8xchar[8],filename[80],filename2[80];int count,x;unsigned char c1,c2,c3,c4;size_t i;int j=0;if (argc<2){printf("Need File name??");exit(1);}fp1=fopen(argv[1],"r");strcpy(fileout,argv[1]);strcat(fileout,".out");fp2=fopen(fileout,"w");while(!feof(fp1)){fgets(line,32000,fp1);j++;if(feof(fp1)) break;if(strlen(line)<4) continue;for(i=0;i<(strlen(line)-3);i++){if(line[i]<=0){c1=256+line[i];c2=256+line[i+1];c3=256+line[i+2];c4=256+line[i+3];switch(c1){case 194 ... 223:fprintf(fp2,"%6d %6ld %2.2x%2.2x [%c%c]\n",j,(long)i, c1,c2,c1,c2);i++;break;case 224 ... 239:fprintf(fp2,"%6d %6ld %2.2x%2.2x%2.2x [%c%c%c]\n",j,(long)i, c1,c2,c3,c1,c2,c3);i++;i++;break;case 240 ... 244:fprintf(fp2,"%6d %6ld %2.2x%2.2x%2.2x%2.2x [%c%c%c%c]\n",j,(long)i, c1,c2,c3,c4,c1,c2,c3,c4);i++;i++;i++;break;}}}}fclose(fp1); fclose(fp2);sprintf(summary,"cut -b 15-30 <%s | sort | uniq -c >%s-sum",fileout,fileout);system(summary);if(!((fp1=fopen("utf-8codeslook.csv","r")))) return 0;sprintf(summary,"%s-sum %s-sum2",fileout,fileout);sscanf(summary,"%s %s", filename,filename2);fp2=fopen(filename,"r");fp3=fopen(filename2,"w");while(!feof(fp2)){x=fscanf(fp2,"%d %s %s",&count,utf8,utf8xchar);if(x<0) break;fp1=fopen("utf-8codeslook.csv","r");while(1){fscanf(fp1,"%[^;];%[^\n] ",code,codedes);p1=strstr(code,utf8);if(p1!=NULL) break;if(feof(fp1)) break;}fprintf(fp3,"%7d %-10s %3s\t%s\n",count,code,utf8xchar,codedes);fclose(fp1);}fclose(fp2); fclose(fp3);return 0;}
Perhaps someone else would find it useful, or perhaps something exist that does something similar that I wasn't able to find. Some mentioned have run across weird files with utf-8. Seems to work for what I want.Was fun figuring it out.Thanks for your time.Would be happy to make utf-8codeslook.xz file available since it was a pain to add all the data from over 100 pages. Could find a single page with the data??First 5 linesc280;<control> (U+0080)c281;<control> (U+0081)c282;BREAK PERMITTED HERE (U+0082)c283;NO BREAK HERE (U+0083)c284;<control> (U+0084)Some descriptions are almost 500 characters??
Yes, and the tables are constantly being updated, so it might be better to have a
tool to generate an up-to-date table locally. It might be more useful to flag certain
classes of characters. Emojis are showing up in code comments and documentation,
but we don't necessarily need to know which Emoji.
--
George N. White III
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure