On 19 Dec 2021 at 7:54, Ed Greshko wrote:
From: Ed Greshko <ed.greshko@xxxxxxxxxxx>
Date sent: Sun, 19 Dec 2021 07:54:31 +0800
Subject: Re: Having strange result on processing
UTF-8 file
To: users@xxxxxxxxxxxxxxxxxxxxxxx
Send reply to: Community support for Fedora
users <users@xxxxxxxxxxxxxxxxxxxxxxx>
> On 19/12/2021 02:15, Michael D. Setzer II via users wrote:
> > Download 64 web pages into a single file using wget2. That is fine.
>
> One more thing.....
>
> The single file you get is an html formatted file, yes? For the results that you want, and how you want to
> use it, do you really want html? If not, why don't you convert to plain text?
>
> Can we assume the 64 pages are always the same pages?
>
Yes. Figured a work around, but not exactly sure that the
issue is that changes the file from UTF-8 to strange type.
system("wget2 --max-threads=70 --secure-protocol=PFS -q --base=\"https://www.uog.edu/directory/\"
-i testlistuog");
testlist.uog has lines
?page=01
?page=02
---
?page=64
But could change if they add more or remove some
currently 633 records. Some lines in the file are over
25000 characters?? Total download is about 13M.
The actual lines I need for the data are just 256K, so it
has lots of junk (stuff I don't need for what I'm doing).
Originally had if find where the UTF-8 characters where
on line, and printed out the hex for the 2 or 3 byte
strings. Then would print from that point in line using
%10.10s since didn't need to see all lines?? But that
causes the problem? But not sure why.
Modified program to just print out the 2 or 3 byte UTF-8
character and file stays the same as original file. Then
tried just using %s and it also stays a UTF-8 file?? But as
I mentioned some lines are over 25000 character? Some
lines have multiple UTF-8 characters, so perhaps the
%10.10s was hitting in the middle of some UTF8 code?
Contents of the main function. Not pretty, but works.
FILE *fp1,*fp2;
char line[32000],fileout[20];
unsigned char c1,c2,c3;
size_t i;
int j=0;
if (argc<2)
{
printf("Need File name??");
exit(1);
}
fp1=fopen(argv[1],"r");
strcpy(fileout,argv[1]);
strcat(fileout,".out");
fp2=fopen(fileout,"wb");
while(!feof(fp1))
{
fgets(line,32000,fp1);
line[strlen(line)-1]=0;
j++;
if(feof(fp1)) break;
if(strlen(line)<3) continue;
for(i=0;i<(strlen(line)-2);i++)
{
if(line[i]<=0)
{
c1=256+line[i];
c2=256+line[i+1];
c3=256+line[i+2];
if(c1!=194
&& c1!=195 && c1!=196 && c1!=200)
fprintf(fp2,"%5d
%5ld %2.2x%2.2x%2.2x [%s]\n",j,(long)i,
c1,c2,c3,&line[i]);
else
fprintf(fp2,"%5d
%5ld %2.2x%2.2x [%s]\n",j,(long)i,
c1,c2,&line[i]);
if(c1!=194
&& c1!=195 && c1!=196 && c1!=200) i++;
i++;
}
}
}
fclose(fp1); fclose(fp2);
return 0;
Thanks again. Will try and figure what causes it to go
from UTF-8?? Like I said, the pages have lots of weird
lines. But get the data I need, and make a mariadb with
the 633 records that can be sorted via php..
There are actually only 3 lines I use that have UTF-8
character - while the main file has 2000 lines with UTF-8
code. Guess atleast one of those lines caused the issue??
131 27 c3b1 [ña, Ph.D.;Crisostomo-Muña;Doreen;Professor of Accounting;School of Business &
Public Administration;735-2501/20;doreentc@xxxxxxxxxxxxxx]
131 51 c3b1 [ña;Doreen;Professor of Accounting;School of Business & Public
Administration;735-2501/20;doreentc@xxxxxxxxxxxxxx]
276 14 c3a5 [åni" Isidro;Isidro;Jaevani;Junior Web Developer;Office of Information
Technology;735-2631;jisidro@xxxxxxxxxxxxxx]
344 18 c381 [Álvarez-Piñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of
Spanish Pacific History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
344 29 c3b1 [ñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish
Pacific History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
344 48 c381 [Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific
History;Micronesian Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
344 59 c3b1 [ñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian
Area Research Center;735-2156;madridc@xxxxxxxxxxxxxx]
tried a number of things with iconv, but still ended with
the problem format.
Again, thanks for the time.
> --
> Did 황준호 die?
> _______________________________________________
> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure