Re: OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



Rajagopal Swaminathan wrote:
> Greetings,
> 
> On Fri, Aug 28, 2009 at 10:50 PM, Les Mikesell<lesmikesell@xxxxxxxxx> wrote:
>> Does anyone have experience with linux tools to parse the text from
>> common non-text file formats for searching?  I'm trying to use the
>> kinosearch add-on for twiki which is fine as far as the search goes, but
>> it takes forever to generate the index.
> 
> I am not sure this answers your query to the point.
> 
> But I have seen Lucene .net SDK (With extensions to scour .doc, .odt,
> .pdf etc.) to very good effect and pretty decent performance.
> 

Wouldn't that have to be run under windows? I think the 'catdoc' package 
from the epel repo with catdoc for word, 'xls2csv' for excel may be 
usable.  Apache POI might work too, but it would probably be slow to 
launch a jvm for every file.  I'm not sure anything does visio, though.

-- 
   Les Mikesell
    lesmikesell@xxxxxxxxx
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux