Re: Recursive Arrays 101

Rob Sargent <robjsargent@xxxxxxxxx> · Mon, 26 Oct 2015 14:59:48 -0600



    On 10/26/2015 02:51 PM, David Blomstrom
      wrote:

    
      I'm focusing primarily on vertebrates at the
        moment, which have a total of (I think) about 60,000-70,000 rows
        for all taxons (species, families, etc.). My goal is to create a
        customized database that does a really good job of handling
        vertebrates first, manually adding a few key invertebrates and
        plants as needed.
        

        I couldn't possibly repeat the process with invertebrates
          or plants, which are simply overwhelming. So, if I ever figure
          out the Catalogue of Life's database, then I'm simply going to
          modify its tables so they work with my system. My vertebrates
          database will override their vertebrate rows (except for any
          extra information they have to offer).
        

        As for "hand-entry," I do almost all my work in
          spreadsheets. I spent a day or two copying scientific names
          from the Catalogue of Life into my spreadsheet. Common names
          and slugs (common names in a URL format) is a project that
          will probably take years. I might type a scientific name or
          common name into Google and see where it leads me. If a
          certain scientific name is associated with the common name
          "yellow birch," then its slug becomes yellow-birch. If two or
          more species are called yellow birch, then I enter
          yellow-birch in a different table ("Floaters"), which leads to
          a disambiguation page.
        

        For organisms with two or more popular common names - well,
          I haven't really figured that out yet. I'll probably have to
          make an extra table for additional names. Catalogue of Life
          has common names in its database, but they all have upper case
          first letters - like American Beaver. That works fine for a
          page title but in regular text I need to make beaver lowercase
          without changing American. So I'm just starting from square
          one and recreating all the common names from scratch.
      
    
    Multiple names can be handled in at least two ways.  A child table
    of species which has species id and alternate name per record - then
    you can get all other-names back by species id.  Of course going
    from altername-name back to species may get you multiple species. 
    Or, welcome to postgres' arrays-as-column: you can have one column,
    maybe called aliases which is an array of string. 

    
        It gets still more complicated when you get into
          "specialist names." ;) But the system I've set up so far seems
          to be working pretty nicely.
      
      
        On Mon, Oct 26, 2015 at 1:41 PM, Rob
          Sargent <robjsargent@xxxxxxxxx>
          wrote:

          On 10/26/2015 02:29 PM, David Blomstrom wrote:

              
                Sorry for the late response. I don't have Internet
                access at home, so I only post from the library or a
                WiFi cafe.

                
                Anyway, where do I begin?

                
                Regarding my "usage patterns," I use spreadsheets
                (Apple's Numbers program) to organize data. I then save
                it as a CSV file and import it into a database table. It
                would be very hard to break with that tradition, because
                I don't know of any other way to organize my data.

                
                On the other hand, I have a column (Rank) that
                identifies different taxonomic levels (kingdom, class,
                etc.). So I can easily sort a table into specific
                taxonomic levels and save one level at a time for a
                database table.

                
                There is one problem, though. I can easily put all the
                vertebrate orders and even families into a table. But
                genera might be harder, and species probably won't work;
                there are simply too many. My spreadsheet program is
                almost overwhelmed by fish species alone. The only
                solution would be if I could import Mammals.csv, then
                import Birds.csv, Reptiles.csv, etc. But that might be
                kind of tedious, especially if I have to make multiple
                updates.

                
            Yes I suspect you spreadsheet will be limited in rows, but
            of course you can send all the spreadsheets to a single
            table in the database. If that's what you want.  You don't
            have to, but you see mention of tables millions of records
            routinely.  On the other hand, if performance becomes an
            issue with the single table approach you might want to look
            at "partitioning".  But I would be surprised if you had to
            go there.

            
            What is your data source?  How much hand-entry are you
            doing? There are tools which (seriously) upgrade the basic
            'COPY into <table>' command.
            
              
                  As for "attributes," I'll post my table's schema, with
                  a description, next.

                  
        -- 

        David Blomstrom

          Writer & Web Designer (Mac, M$ & Linux)

          www.geobop.org