Re: Table Design question for gurus (without going to "NoSQL")...

Gavin Flower <GavinFlower@xxxxxxxxxxxxxxxxx> · Mon, 21 Nov 2011 12:26:38 +1300

On 21/11/11 02:33, Phoenix Kiula wrote:
Hi.

Want to start another thread, loosely related to the performance
problems thread I have going.

Need some DB design guidance from the gurus here.

My big table now has about 70 million rows, with the following columns:

  alias           | character varying(35)
  url             | text
  modify_date     | timestamp without time zone
  ip              | bigint

For each IP address (user of my application) I want to have a unique
URL. So I used to have a UNIQUE constraint on IP, URL. But the index
based on this became huge, as some URLs are gigantic. so I introduced
an md5 of the URL:

  url_md5             | varchar(32)

I now have two scenarios:

1. To have an index (unique?) on "(ip, url_md5)"

2. To not have an index on just the "ip". This way a query that tries
to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
still look only at the ip bit of the index, then refine it with the
url_md5.

The good thing about #2 is the size of index remains very small with
only a bigint field (ip) being indexed.

The bad thing about #2 is that each query of "...WHERE ip = 999 AND
url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
IP address has put in a lot of URLs, then this becomes a bit slow. As
is now happening, where I have users who have over 1 million URLs
each!

Questions:

1. Instead of md5, is there any integer hashing algorithm that will
allow me to have a bigint column and save a lot hopefully in both
storage space and speed?  (Some very useful points mentioned here:
http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
)

2. If I do go with the above scenario #1 of a joint index, is there
any way I can save space and maintain speed? Partitioning etc are out
of the question.

With a growing web database, I am sure many people face this
situation. Are nosql type databases the only sane solution to such
massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Many thanks for any ideas or pointers!

How about having 2 indexes: one on each of ip & url_md5? Pg will combine 
the indexes as required, or will just use one if that is best.

Why don't you have a time zone on your timestamp???

Regards,
Gavin

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general