Michael Enke recently asked in pgsql-bugs about VARDATA and C strings (BUG #2574: C function: arg TEXT data corrupt). Since that's not a bug, I've moved this follow-up to pgsql-general. On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote: > The usual way to get a C string from a TEXT datum is to call textout, > eg > str = DatumGetCString(DirectFunctionCall1(textout, datumval)); Yikes! I've been accessing VARDATA text data like Michael for years (code below). I account for length and don't expect null-termination, but I don't use anything like Tom's suggestion above. (I always try to do what Tom says because that usually hurts less.) I have three questions: 1) I based everything I did on examples lifted nearly verbatim from a 7.x manual, and I bet Michael did similarly. I've never heard of DatumGetCString, DirectFunctionCall1, or textout. Are these and other treasures documented somewhere? 2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do something other than null terminate a string? All of the strings are from [-A-Z0-1*]; server_encoding has been either SQL_ASCII or UTF8 in case that's relevant. 3) Is there any reason to believe that the code below is problematic? Thanks, Reece #include <postgres.h> #include <fmgr.h> #include <ctype.h> #include <string.h> static char* clean_sequence(const char* in, int32 n); PG_FUNCTION_INFO_V1(pg_clean_sequence); Datum pg_clean_sequence(PG_FUNCTION_ARGS) { text* t0; /* in */ text* t1; /* out */ char* tmp; int32 tmpl; if ( PG_ARGISNULL(0) ) { PG_RETURN_NULL(); } t0 = PG_GETARG_TEXT_P(0); tmp = clean_sequence( VARDATA(t0), VARSIZE(t0)-VARHDRSZ ); tmpl = (int32) strlen(tmp); /* copy temp sequence into new pg variable */ t1 = (text*) palloc( tmpl + VARHDRSZ ); if (!t1) { elog( ERROR, "couldn't palloc (%d bytes)", tmpl+VARHDRSZ ); } memcpy(VARDATA(t1),tmp,tmpl); VARATT_SIZEP(t1) = tmpl + VARHDRSZ; pfree(tmp); PG_RETURN_TEXT_P(t1); } /* clean_sequence -- strip non-IUPAC symbols The intent is to strip non-sequence data which might result from copy-pasting a fasta file or some such. in: char*, length out: char*, |out|<=length, NULL-TERMINATED out is palloc'd memory; caller must free allow chars from IUPAC std 20 + selenocysteine (U) + ambiguity (BZX) + gap (-) + stop (*) */ #define isseq(c) ( ((c)>='A' && (c)<='Z' && (c)!='J' && (c)!='O') \ || ((c)=='-') \ || ((c)=='*') ) char* clean_sequence(const char* in, int32 n) { char* out; char* oi; int32 i; out = palloc( n + 1 ); /* w/null */ if (!out) { elog( ERROR, "couldn't palloc (%d bytes)", n+1 ); } for( i=0, oi=out; i<=n-1; i++ ) { char c = toupper(in[i]); if ( isseq(c) ) { *oi++ = c; } } *oi = '\0'; return(out); } -- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0