Dies ist G o o g l e s Cache von http://www.htdig.org/mail/2000/07/0203.html.
G o o g l es Cache enthält einen Schnappschuss der Webseite, der während des Webdurchgangs aufgenommenen wurde.
Unter Umständen wurde die Seite inzwischen verändert.Klicken Sie hier, um zur aktuellen Seite ohne Hervorhebungen zu gelangen.
Um einen Link oder ein Bookmark zu dieser Seite herzustellen, benutzen Sie bitte die folgende URL: http://www.google.com/search?q=cache:k5RnQ-4iTKkJ:www.htdig.org/mail/2000/07/0203.html+conv_doc.pl+htdig&hl=de&ie=UTF-8


Google steht zu den Verfassern dieser Seite in keiner Beziehung.
Diese Suchbegriffe wurden hervorgehoben:  conv_doc  pl  htdig 

Re: [htdig] pdf indexing question


Subject: Re: [htdig] pdf indexing question
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Jul 25 2000 - 09:19:34 PDT


According to Matthew R. MacIntyre:
> I'm having a problem indexing pdf files. The htdig phase seems to work
> fine, no errors are produced, but when the htmerge phase is run, this error
> always shows up:
>
> Deleted, no excerpt: 17/http://svr-newlix/products/technical/faq.pdf
>
> I'm not really sure how to go about fixing this problem. Here's what I have
> in my configuration file:
>
> external_parsers: application/msword->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/postscript->text/html /usr/local/htdig/bin/conv_doc.pl \
> application/pdf->text/html /usr/local/htdig/bin/conv_doc.pl
>
> I was trying to use the parse_doc.pl script instead of the conv_doc.pl
> script for a little while, but I kept getting many errors about acroread not
> showing up, and how the pdf files could not be repaired.

Looks like you're dealing with a few separate problems here.

Errors about acroread not being found shouldn't happen if you properly
configure an external parser or converter for application/pdf, so you
had a configuration error somewhere when trying to use parse_doc.pl.
As long as you're running 3.1.4 or later, you should use conv_doc.pl or
doc2html.pl, rather than parse_doc.pl -- they just work better.

Also, errors about PDF files that couldn't be repaired would come from
acroread as well. These are caused by max_doc_size not being set high
enough for your largest PDF documents. See FAQ 5.1 & 5.2.

Finally, you should run /usr/local/htdig/bin/conv_doc.pl, and perhaps
pdftotext, manually on your products/technical/faq.pdf document to
see what output you get, if any. It may be that the PDF contains only
image data, and no indexable text, or it may be that conv_doc.pl isn't
configured with the right path to the pdftotext executable.

I'm assuming the first two lines of your external_parsers definition
above were split up by your mail program (I rejoined them above), and
they aren't split in your configuration file. A backslash is required
at the very end of all but the last line in a multi-line definition.

If you can make sure that your external_parsers definition is correct,
that max_doc_size is big enough for your PDFs, that running conv_doc.pl
on your PDFs does produce indexable text, and that the PDFs are not
disallowed by your robots.txt file, then you shouldn't get the no excerpt
error above.

-- 
Gilles R. Detillieux              E-mail: <mailto:grdetil@scrc.umanitoba.ca?subject=Re: [htdig] pdf indexing question&replyto=200007251619.LAA15125@cliff.scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 23:18:01 PDT