|
Subject:
Re: [htdig] pdf
indexing question
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date:
Tue Jul 25 2000 - 09:19:34 PDT
According to Matthew R. MacIntyre:
> I'm having a problem indexing
pdf files. The htdig
phase seems to work
> fine, no errors are produced, but when the
htmerge phase is run, this error
> always shows up:
>
> Deleted, no excerpt: 17/http://svr-newlix/products/technical/faq.pdf
>
> I'm not really sure how to go about fixing
this problem. Here's what I have
> in my configuration file:
>
> external_parsers:
application/msword->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/postscript->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/pdf->text/html /usr/local/htdig/bin/conv_doc.pl
>
> I was trying to use the parse_doc.pl script
instead of the conv_doc.pl
> script for a little while, but I kept getting many errors
about acroread not
> showing up, and how the pdf files could not
be repaired.
Looks like you're dealing with a few separate problems here.
Errors about acroread not being found shouldn't happen if you properly
configure an external parser or converter for application/pdf, so you
had a configuration error somewhere when trying to use parse_doc.pl.
As
long as you're running 3.1.4 or later, you should use conv_doc.pl or
doc2html.pl, rather than parse_doc.pl -- they just work better.
Also, errors about PDF files that couldn't be repaired would come from
acroread as well. These are caused by max_doc_size not being set high
enough for your largest PDF documents. See FAQ 5.1 & 5.2.
Finally, you should run /usr/local/htdig/bin/conv_doc.pl, and perhaps
pdftotext, manually on your products/technical/faq.pdf document to
see
what output you get, if any. It may be that the PDF contains only
image
data, and no indexable text, or it may be that conv_doc.pl isn't
configured with the right path to the pdftotext executable.
I'm assuming the first two lines of your external_parsers definition
above were split up by your mail program (I rejoined them above), and
they aren't split in your configuration file. A backslash is required
at
the very end of all but the last line in a multi-line definition.
If you can make sure that your external_parsers definition is correct,
that max_doc_size is big enough for your PDFs, that running conv_doc.pl
on your PDFs
does produce indexable text, and that the PDFs are not
disallowed by your
robots.txt file, then you shouldn't get the no excerpt
error above.
-- Gilles R. Detillieux E-mail: <mailto:grdetil@scrc.umanitoba.ca?subject=Re: [htdig] pdf indexing question&replyto=200007251619.LAA15125@cliff.scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 23:18:01 PDT