Re: Having strange result on processing UTF-8 file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 19 Dec 2021 at 7:54, Ed Greshko wrote:

From:             Ed Greshko <ed.greshko@xxxxxxxxxxx>
Date sent:      Sun, 19 Dec 2021 07:54:31 +0800
Subject:          Re: Having strange result on processing UTF-8 file
To:                  users@xxxxxxxxxxxxxxxxxxxxxxx
Send reply to:           Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>

> On 19/12/2021 02:15, Michael D. Setzer II via users wrote:
> > Download 64 web pages into a single file using wget2. That is fine.
>
> One more thing.....
>
> The single file you get is an html formatted file, yes?  For the results that you want, and how you want to
> use it, do you really want html?  If not, why don't you convert to plain text?
>
> Can we assume the 64 pages are always the same pages?
>
Yes. Figured a work around, but not exactly sure that the issue is that changes the file from UTF-8 to strange type.
system("wget2 --max-threads=70 --secure-protocol=PFS -q --base=\"https://www.uog.edu/directory/\" -i testlistuog");
testlist.uog has lines
?page=01
?page=02
---
?page=64

But could change if they add more or remove some currently 633 records. Some lines in the file are over 25000 characters?? Total download is about 13M.
The actual lines I need for the data are just 256K, so it has lots of junk (stuff I don't need for what I'm doing).

Originally had if find where the UTF-8 characters where on line, and printed out the hex for the 2 or 3 byte strings. Then would print from that point in line using %10.10s since didn't need to see all lines?? But that causes the problem? But not sure why.

Modified program to just print out the 2 or 3 byte UTF-8 character and file stays the same as original file. Then tried just using %s and it also stays a UTF-8 file?? But as I mentioned some lines are over 25000 character? Some lines have multiple UTF-8 characters, so perhaps the %10.10s was hitting in the middle of some UTF8 code?

Contents of the main function. Not  pretty, but works.

FILE *fp1,*fp2;
char line[32000],fileout[20];
unsigned char c1,c2,c3;
size_t i;
int j=0;
if (argc<2)
{
              printf("Need File name??");
              exit(1);
}
fp1=fopen(argv[1],"r");
strcpy(fileout,argv[1]);
strcat(fileout,".out");
fp2=fopen(fileout,"wb");
while(!feof(fp1))
{
              fgets(line,32000,fp1);
              line[strlen(line)-1]=0;
              j++;
              if(feof(fp1)) break;
              if(strlen(line)<3) continue;
              for(i=0;i<(strlen(line)-2);i++)
              {
                           if(line[i]<=0)
                           {
                                         c1=256+line[i];
                                         c2=256+line[i+1];
                                         c3=256+line[i+2];
                                         if(c1!=194 && c1!=195 && c1!=196 && c1!=200)
                                                      fprintf(fp2,"%5d %5ld %2.2x%2.2x%2.2x   [%s]\n",j,(long)i, c1,c2,c3,&line[i]);
                           else
                                                      fprintf(fp2,"%5d %5ld %2.2x%2.2x     [%s]\n",j,(long)i, c1,c2,&line[i]);
                                         if(c1!=194 && c1!=195 && c1!=196 && c1!=200) i++;
                                         i++;
                           }
              }
}
fclose(fp1); fclose(fp2);
return 0;


Thanks again. Will try and figure what causes it to go from UTF-8?? Like I said, the pages have lots of weird lines. But get the data I need, and make a mariadb with the 633 records that can be sorted via php..
There are actually only 3 lines I use that have UTF-8 character - while the main file has 2000 lines with UTF-8 code. Guess atleast one of those lines caused the issue??

  131    27 c3b1     [ña, Ph.D.;Crisostomo-Muña;Doreen;Professor of Accounting;School of Business & Public Administration;735-2501/20;doreentc@xxxxxxxxxxxxxx]
  131    51 c3b1     [ña;Doreen;Professor of Accounting;School of Business & Public Administration;735-2501/20;doreentc@xxxxxxxxxxxxxx]
  276    14 c3a5     [åni" Isidro;Isidro;Jaevani;Junior Web Developer;Office of Information Technology;735-2631;jisidro@xxxxxxxxxxxxxx]
  344    18 c381     [Álvarez-Piñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
  344    29 c3b1     [ñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
  344    48 c381     [Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
  344    59 c3b1     [ñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]

tried a number of things with iconv, but still ended with the problem format.

Again, thanks for the time.

> --
> Did 황준호 die?
> _______________________________________________
> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

  
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux