On 19 Dec 2021 at 9:14, Ed Greshko wrote: From: Ed Greshko <ed.greshko@xxxxxxxxxxx> Date sent: Sun, 19 Dec 2021 09:14:37 +0800 Subject: Re: Having strange result on processing UTF-8 file To: "Michael D. Setzer II" <mikes@xxxxxxxx>, Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx> Send reply to: Community support for Fedora users <users@xxxxxxxxxxxxxxxxxxxxxxx> > On 19/12/2021 08:31, Michael D. Setzer II wrote: > > But could change if they add more or remove some > currently 633 records. Some lines in the file are over > 25000 characters?? Total download is about 13M. > The actual lines I need for the data are just 256K, so it > has lots of junk (stuff I don't need for what I'm doing). > > That 13M file. Does it contain html? If so, would it be easier > to work with if it was converted to plain text? Yes, they are all html pages, but some of the UTF-8 characters don't match to a plain text charter and it is the name field. Did figure out the issue. %10.10s and %20.20s both would cause the problem. So I used the head command to pull various number of lines until I found where the file went Non-ISO extended ascii. Was only a few lines that caused issue, and it was the last character in substring being a character above 127. So added these commands to copy 30 characters from the point, but would then go from end and if last character was >127 change it to null strcpy(linex,&line[i]); linex[30]=0; while(linex[strlen(linex)-1]>127) linex[strlen(linex)-1]=0; The used %s and just printed linex. 218544 lines in allraw.uog 1898 lines in allraw.uog.out (lines with utf-8) The uog.csv has 633 lines but only these 3 have utf-8 131 27 c3b1 [ña, Ph.D.;Crisostomo-Muña;Do] 131 51 c3b1 [ña;Doreen;Professor of Accoun] 276 14 c3a5 [åni" Isidro;Isidro;Jaevani;Ju] 344 18 c381 [Álvarez-Piñer, Ph.D.;Madrid ] 344 29 c3b1 [ñer, Ph.D.;Madrid Álvarez-Pi] 344 48 c381 [Álvarez-Piñer;Carlos;Directo] 344 59 c3b1 [ñer;Carlos;Director / Associa] Whole web page has a lot of other utf-8 characters. Thanks again. > -- > Did 황준호 die? _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure