Back about May 10, I presented a difficulty I was having trying to
add in more (distinct) records from one
MySQL table into another that permanently holds the most important
data that I have. Despite help from
a couple of you in the forum and numerous efforts by me since,
success on my efforts have been denied
<big sigh!>.
First, some review and introduction:
This is for a knowledge base on all woods around the world. The more
woods can be reported,
the more successful is the project. I have named the project 'TAXA"
(..short for taxonomy). Many of the
other features I have wanted to add have been coming on quite well
but this task is still evading any success.
If you care to directly view what this is about, take a look at
http://www.prowebcanada.com/taxa/.
In fact, you can escape the frames menu and go right to where species
only are displayed:
http://www.prowebcanada.com/taxa/alphaspecies.php.
There are 4 major entry points to all the data but you have to be
concerned with the 'species' level.
The data there is stored in a table named 'species' (ie. -
species.sql). All tables in the project are in database 'taxa'. The
more (botanical) names of woods that I can find and add in, the more
successful is the whole project. So far I am reporting just over
6,200 different woods (as I say, all in species.sql).
The Thorny Problem
Presently I use all kinds of sources to find new woods. The largest
list of woods I have is in a sql file
called 'tervuren', with 11,337 wood names. That is WAAYY larger than
what is in species.sql so there
is a huge interest in copying over these records to species.sql It
would be a huge boost that even with
duplicates avoided, that would almost double the number of woods I
could report!
That is easy said but I have almost being tearing my hair out (...
and I am almost bald anyway ;-) )
trying to accomplish this. To prepare for this:
- I protected the integrity of my original files
(species.sql and tervuren.sql) by copying them
as working files species_master.sql and
tervuren_target.sql. It is with these that I will
(somehow) copy all wood species names that are NOT in the
species_master table that
are, however, in the tervuren_target file. (Later after
species_master is expanded, I can
rename it back again to 'species').
- The key column for comparison in both files is called
'species_name' in both files.
- I have already treated the tervuren_target table so that
there are no duplicate records
within itself with respect to column species_name.
- There are a couple columns I do not have to carry over to
the species_master table from
the tervuren_target table. For the columns that do have
to, I made sure
they are named exactly the same in both files and that
the collation is the same in both.
- The field names common to both the source for my data (ie.
tervuren_target) and for
the destination for the new data (species_master) are:
genus_name, species_name, authorities_species, source
- The species_master table has about 18 total fields but
except for the ones above, none of the
other fields I have not listed will receive any new data
whatsoever. They are (so to speak) just
along for the ... ride. All new data will go into the 4
fields listed above exclusively.
- In any attempt to avoid new duplicates forming as data
comes over into species_master,
at NO TIME WHATSOEVER can we allow any records in
species_master to be deleted!
The reason is because meaningful data in the other fields
in species_master would be
lost and data integrity would be violated.
Just adding ALL records from tervuren_ target was easy but it sure
did not give me the result i need. Both
the tervuren_target table and the species_master table have wood
names in them that are common to each.
As copying records over is happening, great care has to be taken to
avoid NEW duplicates forming via
the combining of records from both tables. In fact, just merging all
files from tervuren_target over into
species_master created over 3,000 NEW DUPLICATES! The result was an
unusable mess!
===> .... and that is where all the problem is that has stopped me
from being successful!
I am hoping that an SQL statement would do all the work fine .... but
if that will not work, I am also open to
adding any PHP lines for more versatility if needed.
I have made a few column name changes since approaching the group
back around May 10 but other than that,
I tried out a script tonight suggested to me back then:
SELECT DISTINCT tervuren_target.genus_name,
tervuren_target.species_name, tervuren_target.authorities_species,
tervuren_target.source
FROM tervuren_target
LEFT JOIN species_master
ON tervuren_target.species_name = species_master.species_name
LIMIT 0,12000
I had to add in the LIMIT statement or for some reason the server
would time out every time. The script ran ok
but when I looked at the number of records it created (11,377) that
was EXACTLY how many records
there are in the original Tervuren file! In other words, there was no
success in avoiding new duplicates at all
and the data did NOT actually merge into species_master <big sigh!>.
I also predicted that the DISTINCT
statement would only work on the tervuren_target table (which was
already cleaned of duplicates within itself)
and I believe I was right. it seems to be extraneous and not the
needed answer.
.... I had never thought it should be that hard just to add in new
data to a table of central important without
introducing new duplicate records --- but it is.
Therefore I come in humble request to all of you, hoping someone can
figure out what actually will work.
The solution is harder than may first seem. My email address if you
need it is billmudry at rogers.com
With thanks in advance,
Fingers crossed, toes crossed, eyeballs crossed .......
Bill Mudry
Mississauga, Ontario Canada.