Re: looking for a faster way to do that

hamann.w@xxxxxxxxxxx · 26 Sep 2011 16:18:05 -0000

Alban Hertroys <haramrae@xxxxxxxxx> wrote:

>> > Hi,
>> >=20
>> > the strings are not really known before.
>> > Let me explain the scenario; there is one table about products, and =
>> code is the
>> > manufacturer's (or resellers') product id.
>> > So, if ABC were a maker of laptops, ABC123 and ABC456 might be two =
>> different machines,
>> > and ABC123G might have G3 mobile installed, or ABC123X might be the =
>> same thing
>> > with extra memory. Obviously these device variants all look the same.
>> > Now reseller sends us a collection of product images, so there would =
>> be ABC123.jpg
>> > and ABC456.jpg
>> > The database task at hand is matching products to images (and then =
>> inserting the image
>> > name into a column of the products table).
>> 
>> 
>> I guessed right then. The origin of your problem is that you have =
>> similar items in your database, but the database doesn't know they are =
>> similar. I'd fix that first, it makes the problem a whole lot easier to =
>> handle.
>> 
>> For example, if an image comes in named ABC123G.jpg, you look up the =
>> product and manufacturer and update its image. Then you query for =
>> products of the same manufacturer that are similar to ABC123G (result: =
>> ABC123 and ABC123X) and update their images as well (if appropriate; =
>> perhaps they have a recent enough image of their own?).
>> 
>> 
>> As another whacky alternative to your regular expressions; I think it =
>> would be possible to abuse the text-search functionality in Postgres to =
>> match product id's. Those id's are basically a language per manufacturer =
>> describing product details.
>> 
>> If you can split the product id's up into lexemes that describe the base =
>> product id and it's options, then you can use full-text search to match =
>> up expressions similar to the lexemes derived from the image name.
>> 
>> For example:
>>  productid | lexemes
>> -----------+------------------
>>  ABC123    | {'ABC' '123'}
>>  ABC123G   | {'ABC' '123' 'G'}
>>  ABC123X   | {'ABC' '123' 'X'}
>>  ABC456    | {'ABC' '456'}
>> 
>> I'm not really sure if that's possible, or how much work it would be per =
>> manufacturer - I haven't used FTS much.
>> 
>> I'd first see if I couldn't add that similarity information to the =
>> products table, though ;)
>> 
>> Alban Hertroys
>> 
>> --
>> The scale of a problem often equals the size of an ego.
>> 
>> 

Hi,

the actual process tends to be
- add products to database
- then receive images and try to match them to products.

So I know about the naming scheme only when I see a list of files, and would want to have
a cmdline option for my matching script that distinguishes formats like
a) exact
b) alpha suffix following numeric main body
c) period or slash between main and related
To make things even more complex, I might receive images from a reseller that offers
a few manufacturers using different conventions.

I wonder whether this would translate well into building a temporary index, if I detect b or c
patterns.

When I asked first, I also started working on a different approach. This did work on the
current case (I got similar datasets before, but with considerably fewer items), so I will
probably try that and some ideas I got from this discussion, and see how far I get.
This approach is a separate perl script that builds a tree structure of all the image names
and then tries to create a regex for a crude filter. In the example it would have
determined that all images in the lot match an ABC prefix. Then it selects all matching
codes (so it can work with the entire database) and runs them through the same tree
structure.

Regards
Wolfgang Hamann

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general