Rich: On Wed, May 22, 2019 at 6:07 PM Rich Shepard <rshepard@xxxxxxxxxxxxxxx> wrote: > On Wed, 22 May 2019, Francisco Olarte wrote: > > You are not reading what we write to you. Note YOU AND ONLY YOU are the > > one speaking of PK. We are speaking of "unique identifier" ( that would > > be, IIRC, "candidate keys", you can peek any as your PK, or even introduce > > a new synthetic one with a sequence, or a femto second exact timestamp or > > whatever ). > The organizations table has org_id (an integer) as PK. > The people table has person_id (an interger) as PK and org_id as the > reference to organization.org_id. > Does this help? It does. But if, as I assume, you are referring to the FINAL tables where you have the data inserted, to suggest a good method we would also need to know HOW are the org_id / person_id generated ( i.e. sequences, by hand, externally assigned ). Also, when I speak of "unique identifier" I'm not speaking of the one if your FINAL tables, I assume you would have at least the *_id field as PKEY, so nothing else needed, but the one in your SOURCE data set ( it can be anything, like the row number in the original excel ). If you read the examples I sent you I have "unique identifiers" for organizations in both of them, in the first one it was the organization name (clavelitos/bits cuadrados) in the other one it was "the previously non indented line". Both can be processed with a little aws/perl +sql or with a lot of sql ( i.e., the indented one can be loaded in a big temp table with a lot of text fields, or a single one, and then split with sql functions, but perl/awk is much easier ). > > When you are fluent in SQL you do not try to play with files, you import > > every column of your data into temporary tables, clean them up, and join ( > > if needed ) them until you have a select that gives you what you want and > > then insert this. Normally you insert several SELECTS into temporary > > tables ( specially when you only have thousands of records ) so you can do > > the clean up in steps. > Most of my time is spent writing using LaTeX/LyX. Depending on the project's > needs I'll also use SQL, R, GRASS, and other tools. I'm a generalist, like > your PCP, not a specialist. But, I also rely on emacs, grep, sed, and awk > for data munging and am more fluent with these tools than I am with SQL or > Python. MMM, apart from angel dust I do not know what PCP could stand for. But anyway, for simple one liners and data filtering, sed/awk are more than enough ( I substitute them with perl because it can do the same in a little more verbose way and I had already learned it when confronted with awk/sed ) ( I do not find Python particularly palatable for one-shot text processing, YMMV ). If you are not too fluent in SQL you normally do the pre-cleaning step in emacs ( marking lines on it's type, pasting long ones, there is no substitute to eyeballing for that ), then use a couple scripts to separate data types (orgs and persons ) in distinct files and clean extraneous quotes, commas etc(sed, pure shell, awk, any language can do this ). At this stage you have something like a "clean CSV dump" ( although I would recommend to use the old format of postgres dumps, newline terminated, tab delimited, backslash scaped, it's much, much easier to work with ). Then you import this into a couple of temporary table and use sql to do what it is good at ( adding a sequential column and populating it with unique ids, populating foreign key columns by joining on a collection of fields, and slicing the useful columns to its final destination). And, if you hit a snag, you ask, but ask with a good description of your problem. You have extra knowledge which you are trickling down to us, which leads to huge frustration. We want to help, but you are not letting us, probably not on purpose, by posting an incomplete description of your problem and refusing to answer what we think are simple and basic questions ( I assume this is due to "impedance mismatch", not bad faith, otherwise I would just have blacklisted you several messages ago ). From what I know of your problem so far, A thousand lines, in csv, two tables, with some way to match one with the other, my money is it could have been solved in much less time than this message is taking giving the right info in the first place. > For me, the quickest and simplest appoach is to add the PKs to each table, > and the org_id into the people table, when I separate the cleaned text file > into the columns for each table. That I do a lot, for small files. Being very fluent in perl, which excels at text file crunching, I normally go to one-liners as soon as I hit the couple of screens size. The problem is you stated a problem, we tried to help you, but it seemed like you did not want to be. We still do not know zilch about your problem. I.e., ¿ Are there more than one person per organization ( I assume so, as you are talking of FK in persons to org ). So, given two persons belonging to the same organization, how do you know that info ( when seeing the excel/csv ) ? The answer to this questions is crucial. It may be something as "the organization data is duplicated in every row", in which case the solution to your problem is trivial. Or "the first person has organization data in it's row, and the ones below them do not have it", which is trivially mapped to your previous case using a window query. Or "organizations with one person have all the data in one row, organizations with more have one row with only the organization and one more row per person below, without organization data", which can be done too with a little more effort, especially if you can do a couple of grep/awk + some emacs editing + some simple sql. Anyway, my final recomendation after all this mails, do it in emacs. Regards Francisco Olarte.