add pdf-to-text.md

2022-08-31 20:00:29 +02:00
parent c21857da46
commit d2746b505d
1 changed files with 20 additions and 0 deletions
--- a/data/pdf-to-txt.md
+++ b/data/pdf-to-txt.md
@@ -0,0 +1,20 @@
 ---
 title: "pdf to txt"
 tags: [ "Documentation", "data", "pdf", "ocr" ]
 ---
 How to translate pdf book images to text (results are very poor, and will need lots of corrections).
 ## Dependencies
 Search for 'tesseract english' (or whatever language).
 Arch: tesseract-data-eng and poppler-utils
 ## Script
 > pdftoppm -png *file*.pdf test
 > for x in \*png; do
 > tesseract -l eng  "$x" - >> *out*.txt
 > done