Searchable pdf creator

2/18/2024

I'll be writing up HOWTOs (scripts and everything) when I'm done to give the love back to the community so please don't be afriad to pitch in or complain to the program creators, etc.

To merge your document with Adobe Acrobat, Open the Merge PDF tool and select the PDFs you wish to join. It is a one-stop solution for all your PDF needs and lets you carry out all PDF functions in one place. The subroutine uses a html parser (HTML::TokeParser), so I guess it's not rocket science, but, please, does anyone know of anyway to get this exact functionality achieved over the command line?Īlternatively, does anyone know of an equivalent quality hocr > pdf creator? If not, is anyone game to help me write a CLI equivalent? Its time to wrap up the list of the best PDF creators with Adobe Acrobat Online. I've trawled through the code (perl, not my strong suit) and it appears there is a subroutine called boxes which works out the size and position of the text (x,y x,y text). BUT I can't use it over the command line. The best by far and away is gscan2pdf- it's almost 100% accurate and puts the text in the right place, and searching through the resulting PDF actually returns results, albeit with hilariously odd text sizes in the same sentence. (hocr2pdf -i -o HTTP-ocr.pdf 95% accuracy, and it'd be nice to get the layout too. Sample text output (copy/pasted from PDF- the same text you are searching through in the PDF document) can be found here:

If you want to see what I mean, the various tests I've run so far can be seen below:Īll OCR was done on the image. This means the PDF isn't searchable properly. hocr2pdf, the frontrunner, succesfully puts the text in the right place but garbles it up- even with the "-s" (sloppy switch). I've come across several options that supposedly allow me to do what I want- tesseract to hocr (very nearly perfect html layout!), ocropus via a layout plugin etc but none are very accurate when it comes to doing the layout. I can OCR the text almost to 100% accuracy (tesseract is the winner here by far), and get layout pretty close, however when it comes to defining the layout in a PDF I'm getting stuck. I've already managed to automate the scanning process, so the machine will scan to PDF and archive it already. I'm in the process of setting up a kick-*** server (and am frustratingly close to the finish line!)- one of the things I'd like it to be able to do is scan in documents and automatically OCR them, so that at the end I'm left with a searchable PDF of whatever it was I scanned in.

0 Comments

Searchable pdf creator

Leave a Reply.

Author

Archives

Categories