Re: tsearch2 dictionary that indexes substrings?

Tilmann Singer <tils-pgsql@xxxxxxxx> · Mon, 23 Apr 2007 19:16:59 +0200

* Oleg Bartunov <oleg@xxxxxxxxxx> [20070420 11:32]:
> >If I understand it correctly such a dictionary would require to write
> >a custom C component - is that correct? Or could I get away with
> >writing a plpgsql function that does the above and hooking that
> >somehow into the tsearch2 config?
> 
> You need to write C-function, see example in 
> http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html

Thanks.

My colleague who speaks more C than me came up with the code below
which works fine for us. Will the memory allocated for lexeme be freed
by the caller?

Til

/* 
 * Dictionary for partials of a word, ie. foo => {f,fo,foo}
 *
 * Based on the tsearch2/gendict/config.sh generator
 *
 * Author: Sean Treadway
 *
 * This code is released under the terms of the PostgreSQL License.
 */
#include "postgres.h"

#include "dict.h"
#include "common.h"

#include "subinclude.h"
#include "ts_locale.h"

#define is_utf8_continuation(c) ((unsigned char)(c) >= 0x80 && (unsigned char)(c) <= 0xBF)

PG_FUNCTION_INFO_V1(dlexize_partial);
Datum dlexize_partial(PG_FUNCTION_ARGS);
Datum
dlexize_partial(PG_FUNCTION_ARGS) {
  char*  in = (char*)PG_GETARG_POINTER(1);

  char*  utxt = pnstrdup(in, PG_GETARG_INT32(2)); /* palloc */
  char*  txt = lowerstr(utxt);                    /* palloc */
  int    txt_len = strlen(txt);

  int    results = 0;
  int    i = 0;

  /* may overallocate, that's ok */
  TSLexeme   *res = palloc(sizeof(TSLexeme)*(txt_len+1));

  for (i = 1; i <= txt_len; i++) {
    /* skip UTF8 control codes until EOS */
    if (!is_utf8_continuation(txt[i])) {
      res[results++].lexeme = pnstrdup(txt, i);
    }
  }

  res[results].lexeme=NULL;

  pfree(utxt);
  pfree(txt);

  /* Receiver must free res memory and res[].lexeme */
  PG_RETURN_POINTER(res);
}