> On Nov 20, 2011, at 21:33, Phoenix Kiula <phoenix.kiula@xxxxxxxxx> wrote: > > My big table now has about 70 million rows, with the following columns: > > alias | character varying(35) > url | text > modify_date | timestamp without time zone > ip | bigint > > > For each IP address (user of my application) I want to have a unique > URL. So I used to have a UNIQUE constraint on IP, URL. But the index > based on this became huge, as some URLs are gigantic. so I introduced > an md5 of the URL: > > url_md5 | varchar(32) I don't understand this. So you want to allow exactly one URL for an IP address? Or do you want to allow only one row with the same (IP, URL) values? Because that's exactly what a UNIQUE index on (IP, URL) does. If you want to allow just a single URL for an IP, you should create an index on IP only. > I now have two scenarios: > > 1. To have an index (unique?) on "(ip, url_md5)" > > 2. To not have an index on just the "ip". This way a query that tries > to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will > still look only at the ip bit of the index, then refine it with the > url_md5. Have you actually performed any benchmarks with this? Just create the indexes, run several (say 100) queries and that should give you an idea which of the options is better. We can speculate on the pros/cons of those options, but both are viable under certain conditions. For example if there's just a very small number of URLs for an IP, then #2 is probably going to be better. But if there's enormous number of URLs per IP, then a multi-column index is probably going to perform better. But those are only guesses - try both options, run a few queries (not on the overloaded system you're struggling with - that'd skew the results) and show us EXPLAIN ANALYZE of the queries. > Questions: > > 1. Instead of md5, is there any integer hashing algorithm that will > allow me to have a bigint column and save a lot hopefully in both > storage space and speed? (Some very useful points mentioned here: > http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer > ) You can obviously take an MD5 hash, which is just a 16-byte value and use the first 4 bytes as an integer. The problem with this is that it probably significantly increases the collision. I.e. it will indicate two URLs to be the same, although the URLs are different. > 2. If I do go with the above scenario #1 of a joint index, is there > any way I can save space and maintain speed? Partitioning etc are out > of the question. What is the index size, anyway? You've mentioned it's huge, but what is the actual size? > With a growing web database, I am sure many people face this > situation. Are nosql type databases the only sane solution to such > massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Well, partitioning is usually a good approach to problems like these. Have you thought about moving the URLs into a separate table? I.e. instead of this table alias | character varying(35) url | text url_md5 | character varying(32) modify_date | timestamp without time zone ip | bigint you'd have two tables - one for keeping the URLs: id | integer url | text url_md5 | character varying(32) and the current one, referencing the URLs alias | character varying(35) url_id | integer modify_date | timestamp without time zone ip | bigint That'd allow you to create a UNIQUE index on (ip, url_id), which should be much smaller than the current one. But handling the inserts would be significantly more complex (you'd have to check existence of the URL, insert it etc.). Tomas -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general