On Thu, 2006-09-28 at 12:44 -0400, Carlo Stonebanks wrote: > At this early stage in the project, we are initializing our portal's > database with millions of rows of imported data in over 50 different > flattened tables; each table's structure is unique to the data provider. > This requires a pretty complex import program, because the data must be > matched semantically, not literally. Even with all of the expression > matching and fuzzy logic in the code,our performance statistics show that > the program spends over 75% of its time in SQL queries looking for matching > and/or duplicate data. My experience with that type of load process is that doing this row-by-row is a very expensive approach and your results bear that out. It is often better to write each step as an SQL statement that operates on a set of rows at one time. The lookup operations become merge joins rather than individual SQL Selects via an index, so increase the efficiency of the lookup process by using bulk optimisations and completely avoiding any program/server call traffic. Data can move from step to step by using Insert Selects into temporary tables, as Jim has already suggested. The SQL set approach is different to the idea of simply moving the code server-side by dropping it in a function. That helps with the net traffic but has other issues also. You don't need to use esoteric in-memory thingies if you use the more efficient join types already available when you do set based operations (i.e. join all rows at once in one big SQL statement). You can also improve performance by ordering your checks so that the ones most likely to fail happen first. Trying to achieve a high level of data quality in one large project is not often possible. Focus on the most critical areas of checking and get that working first with acceptable performance, then layer on additional checks while tuning. The complexity of the load programs you have also means they are susceptible to introducing data quality problems rather than removing them, so an incremental approach will also aid debugging of the load suite. Dynamic SQL programs are particularly susceptible to this kind of bug because you can't eyeball the code. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com