> -----Original Message----- > From: Jon Westcot [mailto:jon@xxxxxxxxxxx] > Sent: Saturday, November 24, 2007 4:32 AM > To: PHP General > Subject: Performance question for table updating > > Hi all: > > For those who've been following the saga, I'm working on an > application that needs to load a data file consisting of approximately > 29,000 to 35,000 records in it (and not short ones, either) into > several tables. I'm using MySQL as the database. > > I've noticed a really horrible performance difference between > INSERTing rows into the table and UPDATEing rows with new data when > they already exist in the table. For example, when I first start with > an empty table, the application inserts around 29,600 records in > something less than 6 minutes. But, when I use the second file, > updating that same table takes over 90 minutes. > > Here's my question: I had assumed -- probably wrongly -- that it > would be far more expedient to only update rows where data had actually > changed; moreover, that I should only update the changed fields in the > particular rows. This involves a large number of if statements, i.e., > > if($old_row["field_a"] !== $new_row["field_66"] { > $update_query .= "field_a = '" . > mysql_real_escape_string($new_row["field_66"]) . "',"; > } > > Eventually, I wind up with a query similar to: > > UPDATE table_01 SET field_a = 'New value here', > updated=CURDATE() WHERE primary_key=12345 > > I thought that, to keep the table updating to a minimum, this > approach made the most sense. However, seeing the two hugely different > performance times has made me question whether or not it would be > faster to simply update every field in the table and eliminate all of > these test conditions. > > And, before someone comments that indexes on the table can cause > performance hits, I DROP nearly all of the indexes at the start of the > processing, only keeping those indexes necessary to do the original > INSERT or the subsequent UPDATE, and then add all of the extra > "steroid" indexes (you know -- the performance-enhancing ones <g>) > after all of the INSERTs and UPDATEs have been finished. > > So, long story short (oops -- too late!), what's the concensus > among the learned assembly here? Is it faster to just UPDATE the > record if it already exists regardless of the fact that maybe only one > or two out of 75 or more fields changed versus testing each one of > those 75 fields to try and figure out which ones actually changed and > then only update those? > > I look forward to reading all of your thoughts. > > Sincerely, > > Jon I don't know about consensus over here because I'm kind of newgie (stands for new geek, as opposed to newbie which stands for new ball breaker :D :D ). I don't know of your previous messages but I can tell you one story... Some time ago I got involved in a project that required geo-distance calculation (you know distance between two points with latitude and longitude). Basically I had to take a set of points and calculate the distance of each of those points to a given (reference) one. The math was something like the "square root of the sum of a constant times the square sin of..." well, I can't remember it, but the point is, it was a complicated formula, which I thought it would allow for some optimizations in PHP. Accustomed to regular (compiled) programming languages I developed a set of routines to optimize the task and went ahead and queried the database for the (say, 1000 records) dataset of points. Then applied the math to the points and the reference point and got the result... in about 5 minutes to my (disgusting) surprise. Then I grabbed the MySQL manual, built a "non-optimized" version of the formula to put directly in the SQL query and get the "shortest distance" (which was my goal in the end) calculated by MySQL right away. I thought "ok, I'll prepare a cup of coffee to wait for MySQL to finish the calculation". To my surprise the query returned the expected result in less than 2 seconds. My logic was (wrongly) the following: PHP is a programming language, SQL is a data access language; I'll get the data using MySQL and do the math using PHP. But I forgot PHP is an interpreted language, that a number is more than a number to PHP, but a ZVAL_<whatever> object behind the scenes. I forgot about the memory and the time required to build those objects when one retrieves data out of a database server. I forgot about parsing time, and "support logic and safety checks" in the language that overkill any attempt to build TDCPL (Too Damn Complex Programming Logic) in PHP. So, now, when I have to do some logic stuff to the retrieved data, I first check "how much" I can push into the query itself, to get little or nothing of programming logic in PHP after retrieving (before storing) the data. All that said, I'd give a shot to the MySQL REPLACE function (I wouldn't even branch the code to use INSERT or UPDATE depending on the record already existing or not, If you have a primary key, all you need is REPLACE). But, PLEASE LOOK AT THE GOTCHAS (like set col_name=col_name+1). Furthermore, If those data files were to be uploaded by me (I mean, me, the coder, not the end user), I'd build (use) a program to convert them to SQL sentences in my desktop PC where I can use faster programming languages and I can wait for five minutes of heavy processing (instead of overkilling the server for five minutes which will slow down every other service in there). In the end it depends on your requirements and where you get the data from and if and how you want to automate the task (I didn't get your previous messages, I got subscribed recently, if you can send me a link to those ones... great!) Rob -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php