We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: 1) new postgresql.conf item "mbstr_check" added. 2) if mbstr_check = 0 then invalid characters are not accepted (same as current PostgreSQL behavior). This is the default. 3) if mbstr_check = 1 then invalid characters are accepted with WARNING 4) if mbstr_check = 2 then invalid characters are accepted without any warnings 5) We have checked PostgreSQL source code if accepting invalid characters makes some troubles. We have found that we need to fix a place and the fix is included in the patches. Madison, If you are interested in the patches, I could send it to you. Hackers, Do you think the functionality something like above is worth to add to PostgreSQL? -- Tatsuo Ishii > Hi all, > > I've been chasing down a bug and from what I have learned it may be > because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles > invalid unicode. I've been given some ideas on how to try to catch > invalid unicode but it seems expensive so I am hoping there is a > postgresql way to deal with this problem. > > I've run into a problem where a bulk postgres "COPY..." statement is > dieing because one of the lines contains a file name with an invalid > unicode character. In nautilus this file has '(invalid encoding)' and > the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column > file_name: "Femme Fatal\uffff.url"'. > > To actually look at the file from the shell (bash) shows what appears > to be a whitespace but when I copy/paste the file name I get the > '\uffff' you see above. > > I could, with the help of the TLUG people, use regex to match for an > invalid character and skip the file but that is not ideal. The reason is > that this is for my backup program and invalid unicode or not, the > contents of the file may still be important and I would prefer to have > it in the database so that it is later copied. I can copy and move the > file in the shell so the file isn't apparently in an of itself corrupt. > > So then, is there a way I can tell postresql to accept the invalid > unicode name? Here is a copy of my schema: > > tle-bu=> \d file_info_2 > Table "public.file_info_2" > Column | Type | Modifiers > ----------------------+----------------------+----------------------------------------- > file_group_name | text | > file_group_uid | bigint | not null > file_mod_time | bigint | not null > file_name | text | not null > file_parent_dir | text | not null > file_perm | text | not null > file_size | bigint | not null > file_type | character varying(2) | not null default > 'f'::character varying > file_user_name | text | > file_user_uid | bigint | not null > file_backup | boolean | not null default true > file_display | boolean | not null default false > file_restore_display | boolean | not null default false > file_restore | boolean | not null default false > Indexes: > "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name) > > 'file_name' and 'file_parent_dir' are the columns that could have > entries with the invalid unicode characters. Maybe I could/should use > something other than 'text'? These columns could contain anything that a > file or directory name could be. > > Thanks! > > Madison > > -- > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Madison Kelly (Digimer) > TLE-BU, The Linux Experience; Back Up > http://tle-bu.thelinuxexperience.com > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster