Re: Designing a DB for storing biological data

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Sat, 14 Jun 2014 10:52:37 -0700

On 06/14/2014 08:52 AM, Damir Dezeljin wrote:
Hello.

This is more a theoretical or better to say, conceptual question; still,
I hope to get some feed backs from you folks. Additionally this is going
to be a very long post :) off-topic: I asked a similar question on the
MySQL forum as I'm still undecided if going with PostgreSQL or MySQL <<
I'm tempted at PostGIS.

I am designing a database for storing various biological and ecological
data. Although there is no clear dividing line, it is possible to group
the data into two groups, namely the measured (physical) and
quantitative (mostly biological) data; I uploaded both a data sample and
an initial draft of a DB model to this link
<https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.

 From the mentioned sample, it is evident the following difference
between the two:
*Biological / quantitative data*

  * The data are actually numbers of occurrences of a specific type of
    items, namely animal and plant spices. The counting is done by
    following a predefined method as e.g. number of samples per 100 m^2.
  * One sampling is
  * A sampling consist of counting multiple species on a single day,
    predefined location, by following a predefined method. Please note
    the counting may repeat multiple time for a single species using the
    same or a different method.
  * A typical number of different species counted per sampling is
    something between 15 and 100.
  * Data are mostly quantitative, which means consisting mostly of
    integers numbers; however, this does not apply to all cases.

*Measured / physical data*

  * This data comprise from e.g. a set of measured physical quantities
    such as temperature, salinity, DI, etc. (usually up to 15 or 20
    quantities). These measurements are performed on samples of waters
    taken from different depths at a predefined location on a predefined
    date and time. Although the samples of water from different depth on
    a single location are taken a couple of minutes apart one from
    another, it would help tracking them as a single profile, which
    basically consists of data of analyzed samples from a single
    location at a specific time.

I would agree with consolidating as a single sample run.

  * Most data are decimal numbers of certain precision - e.g. if the
    instrument provides accurate information to the first decimal place,
    it has to be stored with precision up to the first decimal place.
    Contrary, the salinity from the mentioned example available at the
    link above is measured accurately to the third decimal place, so it
    makes sense to store it and make it possible to retrieve the number
    accurate to the third decimal place.
    I was also considering storing depth as a NUMERIC to avoid
    inexactness when dealing with REAL or DOUBLE -> from MySQL I have a
    concern two FLOAT-s (REAL in PostgreSQL) being 3.4 can't be compared
    in a quely like value1 = value2 -> e.g. "... WHERE depth = 3.4;".
    Am I missing something or is there a better solution how to address
    such cases?

If you care about precision use NUMERIC, period. As to the scale(# 
decimal points) that is a little more complicated. The easy solution 
would be to use the maximum scale you would need for all the data 
values. The issue then becomes the following(using your NUMERIC values):

test=> create table numeric_test (num_fld numeric(9,4));
CREATE TABLE
test=> INSERT INTO numeric_test VALUES (15.6);
INSERT 0 1
test=> SELECT * from numeric_test ;
 num_fld
---------
 15.6000

This is ambiguous unless you know what the capabilities of the sampling 
method are.  So either you need to constrain the scale when you set up 
the fields for each sample type or as you show(I think) provide extra 
information to make that determination later.

*General notes*

  * Physical quantities may be outside the detection range of the
    measured instrument; in such a case, this needs to be recorded. I
    still do not have a clear idea how to do it. NULL’s do not seem to
    be a good choice to mark such data.

As an aside, doing a dilution series is not possible?

This is sort of tricky. On the one hand you really don't what the actual 
value is, on the other you know it is at or above(leaving out 
approaching 0 for now) some number, so it is useful information. You 
could do what you show, include an is_out_of_range field. Or you could 
include the detection range information in the same table that records 
the scale of the sampling methods.

  * Different quantities are measured with different precision - e.g.
    counted quantities don’t have decimal places; some instruments
    report data with 1 decimal digit precision, other with 2, etc.

See above.

  * The only quantities that are always present with all data recorded
    are the depth where the sample was taken.
  * I use RESTful interface as a mean layer between the DB and the GUI.

*Finally, here is my dilemma*
I am somewhat undecided what is the best way to implement the database
and consequently what kind of queries to use. At above link a database
model I am currently working on can be found. Looking to the diagram it
becomes evident I am deciding if storing every measurement / determinant
/ depth triple as a separate record. The biggest dilemma I have is a
query for a simple sample of pressure, temperature, salinity and oxygen
would imply multiple joins. As far as I know, this will badly affect the
performance; as well, it will harden codding the RESTful interface.

There is a lot going on here and you will end up with joins which ever 
way you do this. The usual way of dealing with this is to use VIEWs, 
where the data lies in individual tables and you use a VIEW to 
consolidate the data for reporting/query purposes. To an external 
interface it looks like a single table. From the quick look I have done 
so far I would tend to keep each sample as an individual record along 
the lines of:

id sampling_id  sample_depth sample_type sample_value sample_timestamp
1  1            10           DO           8.6          2014-06-14 10:46

Then it is a matter of slicing and dicing as you need:

SELECT * FROM sample_data WHERE sampling_id = 1 ORDER BY sample_depth, 
sample_type

SELECT * FROM sample_data WHERE sampling_id = 1 ORDER BY sample_type, 
sample_depth

SELECT * FROM sample_data WHERE sampling_type = 'DO' ORDER BY 
sample_depth, sample_timestamp

etc

The other option I considered and I did not discard yet is adopting
tables to specific needs. In such case storing data from a CTD
(Conductivity / Temperature / Depth) probe would result in a table row
containing: depth, conductivity, salinity, temperature, depth. Such
approach rather makes sense; however, in such a case I’ll end up with
tons of tables that sometime in future may be extended with additional
columns.

I would appreciate any advice and hint I receive.

Thanks and best regards,
  Damir

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx