Re: Having strange result on processing UTF-8 file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 19 Dec 2021 at 9:14, Ed Greshko wrote:

From:	Ed Greshko <ed.greshko@xxxxxxxxxxx>
Date sent:	Sun, 19 Dec 2021 09:14:37 +0800
Subject:	Re: Having strange result on processing UTF-8 file
To:	"Michael D. Setzer II" <mikes@xxxxxxxx>,
	Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>
Send reply to:	Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx>

> On 19/12/2021 08:31, Michael D. Setzer II wrote:
> 
>     But could change if they add more or remove some 
>     currently 633 records. Some lines in the file are over 
>     25000 characters?? Total download is about 13M.
>     The actual lines I need for the data are just 256K, so it 
>     has lots of junk (stuff I don't need for what I'm doing).
> 
> That 13M file. Does it contain html? If so, would it be easier 
> to work with if it was converted to plain text?

Yes, they are all html pages, but some of the UTF-8 
characters don't match to a plain text charter and it is the 
name field. Did figure out the issue. %10.10s and 
%20.20s both would cause the problem. So I used the 
head command to pull various number of lines until I 
found where the file went Non-ISO extended ascii.
Was only a few lines that caused issue, and it was the last 
character in substring being a character above 127.

So added these commands to copy 30 characters from the 
point, but would then go from end and if last character 
was >127 change it to null
strcpy(linex,&line[i]);
linex[30]=0;
while(linex[strlen(linex)-1]>127) linex[strlen(linex)-1]=0;

The used %s and just printed linex. 
218544 lines in allraw.uog
    1898 lines in allraw.uog.out (lines with utf-8)

The uog.csv has 633 lines but only these 3 have utf-8
  131    27 c3b1     [ña, Ph.D.;Crisostomo-Muña;Do]
  131    51 c3b1     [ña;Doreen;Professor of Accoun]
  276    14 c3a5     [åni" Isidro;Isidro;Jaevani;Ju]
  344    18 c381     [Álvarez-Piñer, Ph.D.;Madrid ]
  344    29 c3b1     [ñer, Ph.D.;Madrid Álvarez-Pi]
  344    48 c381     [Álvarez-Piñer;Carlos;Directo]
  344    59 c3b1     [ñer;Carlos;Director / Associa]

Whole web page has a lot of other utf-8 characters.

Thanks again.


> --
> Did 황준호 die?

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux