Re: Creating composite keys from csv

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Mon, 09 Mar 2015 06:36:43 -0700

On 03/08/2015 08:49 PM, Eli Murray wrote:
Hi all,

I'm a student journalist working on a project for our student paper
which lists salaries and positions for every staff member at the
university. We received the data from an FOI request but the university
is refusing to give us primary keys for the data.

The issue we've run into is that if there are two staff members with the
same name (and there are) our current web app adds their salaries
together and considers them one person. Now, luckily, we can create a
composite key if we combine their name column with their salary column.
Unfortunately, the format of the data we have makes it more difficult
than that (of course!) because some employees can hold multiple paying
positions.

Here's some example data:

Name, Position, Salary,Total Salary, ...
Jane Doe, Dean, 100.000, 148.000, ...
John Locke, Custodian, 30.000, 30.000, ...
Jane Doe, Academic Adviser, 48.000, 148.000, ...
Jane Doe, Trainer, 46.000, 46.000, ...

Basically, what we'd like to do is create a serial primary key but
instead of having it increment every row, it needs to check the name and
total salary columns and only increment if that person doesn't already
exist. If they do exist, it should just assign the previously created
number to the column.

Well the above is not going to work, because the id would not be unique 
across rows and therefore could not be a primary key. If I am following 
what you want is a staff id that identifies a particular staff member 
across rows and is derived from the (Name, Total Salary) combination, is 
that correct? If so you could use a serial column to generate a 
surrogate primary key for each row without worrying about the names and 
total salary. Then it becomes an issue of generating the staff id for 
unique staff members. For that I would see John McKowns answer.

 However, our team is small and between us we have
very little experience working with databases and we haven't found a way
to accomplish this goal yet. In fact, we may be trying to solve this in
the wrong way entirely.

So, to put it succinctly, how would you approach this problem? What are
our options? Do we need to write a script to clean the data into
separate csv tables before we import it to postgres, or is this
something we can do in postgres? We'd really appreciate any help you all
may be able to offer.

Best!
Eli Murray

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general