Re: bytea encode performance issues

Alban Hertroys <dalroi@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Sun, 3 Aug 2008 15:31:48 +0200

On Aug 3, 2008, at 2:36 PM, Sim Zacks wrote:

The LIKE operator is likely the problem, but it is a critical part  
of an
email application. Searches are done by, "Show me all emails  
containing
the following word."

I've tried using TSearch2's full text index. It made the query 50%
faster, taking 5 minutes. This is still not even close to the less  
then
a minute in MySQL.

That LIKE operator is probably your problem. An unbounded LIKE  
like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

There are apparently some possibilities with the new GIN indexes (or
maybe even using GIST), but I haven't had an opportunity to try those
yet. There were some messages about just that on this list recently.

I don't think a functional index (or anything other then a FTI) would
accomplish anything, being that I am doing unbounded Likes.

That's why I suggested to use a text field instead of bytea.
IIRC, You can have an index on word triplets and use tsearch. I don't  
have intimate knowledge on how that works though, hopefully other  
people will chime in here.

Without the need to convert each row before comparing it, and with an  
appropriate index, that should significantly speed up your queries.

If you create an index, make sure you create a _functional_ index  
over
ENCODE(messageblk, 'escape').

Email is binary when it contains attachments. I actually planned on
using an ASCII encoding, but the dbmail people specifically said  
not to.
I don't know if they were speaking from experience, or because ASCII
sounds bad.

It shouldn't be; those attachments are MIME or UU encoded, are they not?
Don't confuse ASCII and SQLASCII. The latter accepts characters from  
any encoding, which is probably what you want.

Since when is e-mail binary data? I don't quite see why you'd use a
bytea field instead of text. If your problem is character  
encoding, then
just don't store that ("encode" the DB using SQLASCII).

As I mentioned, this is the system that came with dbmail. It runs on
both PostGresql and MySQL, so they may have done some compatibility
things. There are 4 statuses possible, 0,1,2,3 if you use the database
through the software then a 99 could never appear there.

The software isn't the only client that might connect to the database.
It is usually bad practice to put data constraint logic in the client  
instead of in the database. Especially since in client code there are  
usually multiple sections of code that have to deal with those  
constraints, which tends to result in small differences in their  
handling.

Next to that, if those statuses would have a proper foreign key  
constraint, it would be very easy to add labels to each status in a  
way they would make a bit more sense than 0, 1, 2, 3.
I expect the label would be a sufficient foreign key by itself  
though, no need for those silly numbers.

Well, there's probably not much you can do about that, being just a  
user of dbMail.

I notice some other oddities in that query/design. Why is is_header a
smallint instead of a boolean? I'm assuming this is for compatibility
with an other database?

Why use status IN (0, 1) instead of more descriptive keys? Is it even
constrained this way, or could arbitrary numbers end up as status  
(say
99) and if so, what happens to those messages?

You have a very odd way of quoting. You don't top post as such, but  
you top post in context... Haven't seen that one before. Usually  
people reply _below_ a section, not above it ;) It made my reply a  
bit harder to write.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4895b34b243488085013917!