Re: searching non plain text files

Tim-Hinnerk Heuer <th.heuer@xxxxxxxxx> · Sat, 15 Dec 2018 23:14:35 +1300

Way back when, I was using pdftotext and other various doc to text converters to copy the text into a second text file which I then indexed with sphinx search. You can nowadays index quite well also with PostgreSQL or maybe even MySQL.
good luck!

On Sat, 15 Dec 2018 17:20 Jeffry Killen <jekillen@xxxxxxxxxxx wrote:
Hello;

Can anyone point me to instruction/advice about

opening and reading files that are not plain text:

word processing docs, pdf, ps, image files,

even complied code.

I am writing a search function to search file systems

and don't know a lot about the formatting of non plain

text files.

The immediate concern is line breaks in word

processing docs, pdf and ps files.

Then detecting compiled code files so I can

leave them alone. This type of file might not

have a suffix to consider.

Then the various image files that might be

encountered.

Even suffixes aren't a guarantee of the content.

Thanks

Jeff K.