I thank Tyrael and Neil for their prior efforts a few days ago.
Unfortunately their
suggestions did not work. I have searched the web and tried many
different combinations
but I am surprised that it is that hard to add from one table of
about 11,300 records into
a master file I use with about 6,550 relevant records while keeping
duplicates in the larger
file to what is already in the smaller one stopped.
This has made me think that first, perhaps I have to get back to
understanding some of the
basics of how various SQL commands and statements work and also how they do NOT
work.
Please help me confirm as true what I will say below or else correct
my understanding.
ISSUE 1
DISTINCT - This is commonly used to get rid of duplicates in ONE
specific file. All records
being found to be duplicate is always in the ONE table and
duplicates are never
defined as duplicate to another record in another table. I have never
seen any example or
reference in which it can be used to prevent new files coming into a
table to be a
duplicate against all records already in a table. Is this not true?
Correct me if I am wrong.
This massive addition of records to the master table is the scenario
I am trying to do
while not introducing new duplicates common to both. What is in the
table accepting new
records must not have any records erased in an effort to not have
duplicates. It just has too
much additional data worth preserving.
The file to be appended to (called species_master) has the exact 4
same column names
as in the first table with the 11,300 records (which I named
tervuren_target) but with many
more fields that would tag along for the ride during a merge of the
two files.
The tervuren_target file has already been cleaned of itself of
duplicate records. Therefore,
if my comment on using DISTINCT is true ..... and cannot be used to
prevent new
duplicates from forming when records are transferred from one table
to another, then in the
append or merge I am trying for ..... using 'DISTINCT' to stop new
duplicates would be a
waste of time! True or False?
ISSUE 2
When using a JOIN, we usually equate a foreign key in one table to a
primary key of the
table being joined to. However, in my need, I am not talking about a
relational connection
between the two tables. I had to first merge the genus and epithet
columns together in the
tervuren_target table to create one-column species names in column
species_name.
Once that was done, all the columns I want to copy over into the
master species file
were deliberately made the same as in the species_master file. There
is no hierarchy
between them --- only more of the same kind of data coming in from a
HUGE source.
SO ...... on a Join (or any other efforts to bring over the data
WITHOUT duplicates)
as said above, it is normally written written as table1 = table2. In
this case, the
comparison would be .....
tervuren_target.species_name = species_master.species_name.
BUT wait! That would allow all records with a species_name equal to one in the
master file to come flying in, right?? Wow, that is the very OPPOSITE
OF WHAT I WANT!
Unless there is something I do not know yet ---> that seems like it
would let nothing BUT
duplicates in!! ..... Totally and absolutely not acceptable if that
would happen.
Therefore, with joins, is it only the equal operator that can be
used? I am leaning to think
that the following is not allowed even if it spells out the condition
that I really need:
tervuren_target.species_name <> species_master.species_name.
So please do correct me if my understanding is false or inadequate.
For now, assuming
I am correct, then just how does one specify how to add further
records into a table
from another table with parallel fields while stopping records coming
over that would
create duplicate data (in this case duplicates in the species_name field?
Because of the close similarity of structures of the two tables and
because of data
merging into the same key column to create duplicates (guaranteed
---- over 3,000 of
them!) I am now wondering if using a JOIN can be used, should be used
or is relevant
at all. So I still need to find how to merge over all those tervuren
records without
creating NEW duplicates in the process. First, though, there is one more issue.
ISSUE 3
The script that was suggested to me is:
INSERT INTO species_master (genera, name, species_name, authorities,
species_source)
SELECT DISTINCT genera, name, species_name, authorities, species_source FROM
'tervuren_target'
LEFT JOIN species_master
ON
tervuren_target.species_name = species_master.species_name
WHERE species_master.species_name IS NULL
My next issue is about the last line:
WHERE species_master.species_name IS NULL
I believe I was told this was to help prevent duplicates ---
but I cannot see that. Here is my understanding
why. Perhaps I do not know enough details about how that
WHERE statement is parsed and made to work.
Before the tables are merged, all records in
species_master.species_name have species names in
them. That column just does not have ANY NULL entries! Therefore ----
am I not correct to say that there is
absolutely no conditions to satisfy the WHERE statement? Nothing can happen!
One other thing just came to mind. In all my efforts in this, as I
ran different script attempts, I would frequently
realize I had to state fields as "table.column" only to find that the
system would say it does not know that column.
Strange thought I. Why would it do that? Did I still do something
wrong or incomplete?
One last word before I am told I have been off topic. My entire
project is done in PHP with storage in MySQL.
I was hoping that doing this above merge of data in PHP might be
overkill if SQL can handle it. I am still
quite open to a PHP script solution if that is better or if SQL just
cannot cut it.
So ...... <mild sigh> ...... what really will do the job?
Always thankful for help,
Bill Mudry
Mississauga, Ontario