> Back about May 10, I presented a difficulty I was having trying to > add in more (distinct) records from one > MySQL table into another that permanently holds the most important > data that I have. Despite help from > a couple of you in the forum and numerous efforts by me since, > success on my efforts have been denied > <big sigh!>. > > First, some review and introduction: > This is for a knowledge base on all woods around the world. The more > woods can be reported, > the more successful is the project. I have named the project 'TAXA" > (..short for taxonomy). Many of the > other features I have wanted to add have been coming on quite well > but this task is still evading any success. > If you care to directly view what this is about, take a look at > http://www.prowebcanada.com/taxa/. > In fact, you can escape the frames menu and go right to where species > only are displayed: > http://www.prowebcanada.com/taxa/alphaspecies.php. > There are 4 major entry points to all the data but you have to be > concerned with the 'species' level. > > The data there is stored in a table named 'species' (ie. - > species.sql). All tables in the project are in database 'taxa'. The > more (botanical) names of woods that I can find and add in, the more > successful is the whole project. So far I am reporting just over > 6,200 different woods (as I say, all in species.sql). > > The Thorny Problem > Presently I use all kinds of sources to find new woods. The largest > list of woods I have is in a sql file > called 'tervuren', with 11,337 wood names. That is WAAYY larger than > what is in species.sql so there > is a huge interest in copying over these records to species.sql It > would be a huge boost that even with > duplicates avoided, that would almost double the number of woods I > could report! > > That is easy said but I have almost being tearing my hair out (... > and I am almost bald anyway ;-) ) > trying to accomplish this. To prepare for this: > > - I protected the integrity of my original files > (species.sql and tervuren.sql) by copying them > as working files species_master.sql and > tervuren_target.sql. It is with these that I will > (somehow) copy all wood species names that are NOT in the > species_master table that > are, however, in the tervuren_target file. (Later after > species_master is expanded, I can > rename it back again to 'species'). > > - The key column for comparison in both files is called > 'species_name' in both files. > > - I have already treated the tervuren_target table so that > there are no duplicate records > within itself with respect to column species_name. > > - There are a couple columns I do not have to carry over to > the species_master table from > the tervuren_target table. For the columns that do have > to, I made sure > they are named exactly the same in both files and that > the collation is the same in both. > > - The field names common to both the source for my data (ie. > tervuren_target) and for > the destination for the new data (species_master) are: > > genus_name, species_name, authorities_species, source > > - The species_master table has about 18 total fields but > except for the ones above, none of the > other fields I have not listed will receive any new data > whatsoever. They are (so to speak) just > along for the ... ride. All new data will go into the 4 > fields listed above exclusively. > > - In any attempt to avoid new duplicates forming as data > comes over into species_master, > at NO TIME WHATSOEVER can we allow any records in > species_master to be deleted! > The reason is because meaningful data in the other fields > in species_master would be > lost and data integrity would be violated. > > Just adding ALL records from tervuren_ target was easy but it sure > did not give me the result i need. Both > the tervuren_target table and the species_master table have wood > names in them that are common to each. > As copying records over is happening, great care has to be taken to > avoid NEW duplicates forming via > the combining of records from both tables. In fact, just merging all > files from tervuren_target over into > species_master created over 3,000 NEW DUPLICATES! The result was an > unusable mess! > > ===> .... and that is where all the problem is that has stopped me > from being successful! > > I am hoping that an SQL statement would do all the work fine .... but > if that will not work, I am also open to > adding any PHP lines for more versatility if needed. > > I have made a few column name changes since approaching the group > back around May 10 but other than that, > I tried out a script tonight suggested to me back then: > > SELECT DISTINCT tervuren_target.genus_name, > tervuren_target.species_name, tervuren_target.authorities_species, > tervuren_target.source > FROM tervuren_target > LEFT JOIN species_master > ON tervuren_target.species_name = species_master.species_name > LIMIT 0,12000 > > I had to add in the LIMIT statement or for some reason the server > would time out every time. The script ran ok > but when I looked at the number of records it created (11,377) that > was EXACTLY how many records > there are in the original Tervuren file! In other words, there was no > success in avoiding new duplicates at all > and the data did NOT actually merge into species_master <big sigh!>. > I also predicted that the DISTINCT > statement would only work on the tervuren_target table (which was > already cleaned of duplicates within itself) > and I believe I was right. it seems to be extraneous and not the > needed answer. > > .... I had never thought it should be that hard just to add in new > data to a table of central important without > introducing new duplicate records --- but it is. > > Therefore I come in humble request to all of you, hoping someone can > figure out what actually will work. > The solution is harder than may first seem. My email address if you > need it is billmudry at rogers.com > > With thanks in advance, > > Fingers crossed, toes crossed, eyeballs crossed ....... > > Bill Mudry > Mississauga, Ontario Canada. This would be much better taken to the DB list, where the entire audience is knowledgable about databases, so answers are more likely ;-) First, what database engine are you using? All engines have specific idiosyncrasies that need to be accounted for, and knowing the engine will allow more specific answers to your problems. -- Niel Archer -- PHP Windows Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php