Re: Bulk inserts into two (related) tables

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Wed, 22 May 2019 18:46:42 +0200

Rich:

On Wed, May 22, 2019 at 6:07 PM Rich Shepard <rshepard@xxxxxxxxxxxxxxx> wrote:
> On Wed, 22 May 2019, Francisco Olarte wrote:
> > You are not reading what we write to you. Note YOU AND ONLY YOU are the
> > one speaking of PK. We are speaking of "unique identifier" ( that would
> > be, IIRC, "candidate keys", you can peek any as your PK, or even introduce
> > a new synthetic one with a sequence, or a femto second exact timestamp or
> > whatever ).

> The organizations table has org_id (an integer) as PK.
> The people table has person_id (an interger) as PK and org_id as the
> reference to organization.org_id.
> Does this help?

It does. But if, as I assume, you are referring to the FINAL tables
where you have the data inserted, to suggest a good method we would
also need to know HOW are the org_id / person_id generated ( i.e.
sequences, by hand, externally assigned ).

Also, when I speak of "unique identifier" I'm not speaking of the one
if your FINAL tables, I assume you would have at least the *_id field
as PKEY, so nothing else needed, but the one in your SOURCE data set (
it can be anything, like the row number in the original excel ). If
you read the examples I sent you I have "unique identifiers" for
organizations in both of them, in the first one it was the
organization name (clavelitos/bits cuadrados) in the other one it was
"the previously non indented line". Both can be processed with a
little aws/perl  +sql or with a lot of sql ( i.e., the indented one
can be loaded in a big temp table with a lot of text fields, or a
single one, and then split with sql functions, but perl/awk is much
easier ).

> > When you are fluent in SQL you do not try to play with files, you import
> > every column of your data into temporary tables, clean them up, and join (
> > if needed ) them until you have a select that gives you what you want and
> > then insert this. Normally you insert several SELECTS into temporary
> > tables ( specially when you only have thousands of records ) so you can do
> > the clean up in steps.
> Most of my time is spent writing using LaTeX/LyX. Depending on the project's
> needs I'll also use SQL, R, GRASS, and other tools. I'm a generalist, like
> your PCP, not a specialist. But, I also rely on emacs, grep, sed, and awk
> for data munging and am more fluent with these tools than I am with SQL or
> Python.

MMM, apart from angel dust I do not know what PCP could stand for. But
anyway, for simple one liners and data filtering, sed/awk are more
than enough ( I substitute them with perl because it can do the same
in a little more verbose way and I had already learned it when
confronted with awk/sed ) ( I do not find Python particularly
palatable for one-shot text processing, YMMV ). If you are not too
fluent in SQL you normally do the pre-cleaning step in emacs ( marking
lines on it's type, pasting long ones, there is no substitute to
eyeballing for that ), then use a couple scripts to separate data
types (orgs and persons ) in distinct files and clean extraneous
quotes, commas etc(sed, pure shell, awk, any language can do this ).
At this stage you have something like a "clean CSV dump" ( although I
would recommend to use the old format of postgres dumps, newline
terminated, tab delimited, backslash scaped, it's much, much easier to
work with ). Then you import this into a couple of temporary table and
use sql to do what it is good at ( adding a sequential column and
populating it with unique ids, populating foreign key columns by
joining on a collection of fields, and slicing the useful columns to
its final destination).

And, if you hit a snag, you ask, but ask with a good description of
your problem. You have extra knowledge which you are trickling down to
us, which leads to huge frustration. We want to help, but you are not
letting us, probably not on purpose, by posting an incomplete
description of your problem and refusing to answer what we think are
simple and basic questions ( I assume this is due to "impedance
mismatch", not bad faith, otherwise I would just have blacklisted you
several messages ago ). From what I know of your problem so far, A
thousand lines, in csv, two tables, with some way to match one with
the other, my money is it could have been solved in much less time
than this message is taking giving the right info in the first place.

> For me, the quickest and simplest appoach is to add the PKs to each table,
> and the org_id into the people table, when I separate the cleaned text file
> into the columns for each table.

That I do a lot, for small files. Being very fluent in perl, which
excels at text file crunching, I normally go to one-liners as soon as
I hit the couple of screens size.

The problem is you stated a problem, we tried to help you, but it
seemed like you did not want to be. We still do not know zilch about
your problem. I.e., ¿ Are there more than one person per organization
( I assume so, as you are talking of FK in persons to org ). So, given
two persons belonging to the same organization, how do you know that
info ( when seeing the excel/csv ) ? The answer to this questions is
crucial. It may be something as "the organization data is duplicated
in every row", in which case the solution to your problem is trivial.
Or "the first person has organization data in it's row, and the ones
below them do not have it", which is trivially mapped to your previous
case using a window query. Or "organizations with one person have all
the data in one row, organizations with more have one row with only
the organization and one more row per person below, without
organization data", which can be done too with a little more effort,
especially if you can do a couple of grep/awk + some emacs editing +
some simple sql.

Anyway, my final recomendation after all this mails, do it in emacs.

Regards
   Francisco Olarte.