On 15/03/12 21:12, Jeff Davis wrote:
On Fri, 2012-03-16 at 01:57 +0530, Alexander.Bagerman@xxxxxxxxxxxxx
We have hard time identifying MS/Open Office and PDF parsers to index stored documents and make them available for text searching.
The first step is to find a library that can parse such documents, or convert them to a format that can be parsed.
I've used docx2txt and pdf2txt and friends to produce text files that I then index during the import process. An external script runs the whole process. All I cared about was extracting raw text though, this does nothing to identify headings etc.
-- Richard Huxton Archonet Ltd -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general