OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers

Les Mikesell <lesmikesell@xxxxxxxxx> · Fri, 28 Aug 2009 12:20:38 -0500

Does anyone have experience with linux tools to parse the text from 
common non-text file formats for searching?  I'm trying to use the 
kinosearch add-on for twiki which is fine as far as the search goes, but 
it takes forever to generate the index. It uses xpdf to extract strings 
from pdf's, antiword for .doc, and since it is perl, the 
Spreadsheet::ParseExcel module for .xls.  Some documents parse/index 
quickly, some extremely slowly, and in the .xls case some seem to hang 
forever.  I think the real issue is when the parsers (correctly or 
incorrectly) detect a wide character set and the indexer is confused 
when trying to re-encode it.  What is the best approach to debug 
something that might be in the perl character set handlers?

-- 
   Les Mikesell
    lesmikesell@xxxxxxxxx

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos