Re: [htdig] pdf indexing question

Subject: Re: [htdig] pdf indexing question
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Jul 25 2000 - 09:19:34 PDT

According to Matthew R. MacIntyre:
> I'm having a problem indexing pdf files. The htdig phase seems to work
> fine, no errors are produced, but when the htmerge phase is run, this error
> always shows up:
> Deleted, no excerpt: 17/http://svr-newlix/products/technical/faq.pdf
> I'm not really sure how to go about fixing this problem. Here's what I have
> in my configuration file:
> external_parsers: application/msword->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/postscript->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/pdf->text/html /usr/local/htdig/bin/conv_doc.pl
> I was trying to use the parse_doc.pl script instead of the conv_doc.pl
> script for a little while, but I kept getting many errors about acroread not
> showing up, and how the pdf files could not be repaired.

Looks like you're dealing with a few separate problems here.

Errors about acroread not being found shouldn't happen if you properly
configure an external parser or converter for application/pdf, so you
had a configuration error somewhere when trying to use parse_doc.pl.
As long as you're running 3.1.4 or later, you should use conv_doc.pl or
doc2html.pl, rather than parse_doc.pl -- they just work better.

Also, errors about PDF files that couldn't be repaired would come from
acroread as well. These are caused by max_doc_size not being set high
enough for your largest PDF documents. See FAQ 5.1 & 5.2.

Finally, you should run /usr/local/htdig/bin/conv_doc.pl, and perhaps
pdftotext, manually on your products/technical/faq.pdf document to
see what output you get, if any. It may be that the PDF contains only
image data, and no indexable text, or it may be that conv_doc.pl isn't
configured with the right path to the pdftotext executable.

I'm assuming the first two lines of your external_parsers definition
above were split up by your mail program (I rejoined them above), and
they aren't split in your configuration file. A backslash is required
at the very end of all but the last line in a multi-line definition.

If you can make sure that your external_parsers definition is correct,
that max_doc_size is big enough for your PDFs, that running conv_doc.pl
on your PDFs does produce indexable text, and that the PDFs are not
disallowed by your robots.txt file, then you shouldn't get the no excerpt
error above.

Gilles R. Detillieux              E-mail: <mailto:grdetil@scrc.umanitoba.ca?subject=Re: [htdig] pdf indexing question&replyto=200007251619.LAA15125@cliff.scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

