Update
I just got this task done.
I did not use Pandas, mainly because I wanted to work in SQL with a real database instead of a CSV file.
I used sqliteBrowser to import the data from the CSV file. It worked well except that it got the data type wrong on some of the 150+ columns (fields) and I had to go in and edit them manually. It did this because the data in the rows that it looked at was inconsistent. It didn't make a mistake, the data it was given was garbage.
I was able to edit the fields easily enough, but it was slow because the user interface is a bit clunky. No biggie, just took longer than it might have. Still worked well though.
At that point I had a big table that needed to be converted into actual database records so that I could manipulate them properly with SQL. My table was 110K x 150 (row x column). I needed to convert that to ~16.5M individual records.
I tried doing this with SQL statements and an SQL wizard could probably do it, but I found it frustrating. It wasn't that I couldn't convert the individual cells into a record, it was that the records needed some data checking and manipulation before being inserted into a new table.
I turned to my trusty friend, Python and found sqlite3, which is a Python driver for accessing sqlite. From there things went really well.
I encountered a small problem when I hit empty cells that were NULL in the source table. I needed to convert them to 0.0. I found out that they read as None in Python and so my fix was:
output = row[column]
if (output is None): #this is for a NULL field in the db
output = 0.0
if (output is None): #this is for a NULL field in the db
output = 0.0
My hangup was that I tried comparing (using ==) them to "", '', "None", NULL, Null, etc. Finally I realized that Python read them as None and then it worked.
We are using this data to make some large decisions. It is great to have such fantastic tools available on a moment's notice, for free, to be able to do this work. The supplier of the data was pushing to the manipulation for a hefty hourly fee. I'm glad we didn't have to resort to that. Now that we have good tools to do this work, we can mine this data any way we want, using SQL and/or Python.
Linux rocks ! Fedora rocks !
Thanks for the quick reply late in the day on a Friday. Having finished this up, I don't have to worry about it for the rest of the weekend.
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx