From d2746b505d2d8fd18dc22cf15c74082a758a5348 Mon Sep 17 00:00:00 2001 From: Malin Freeborn Date: Wed, 31 Aug 2022 20:00:29 +0200 Subject: [PATCH] add pdf-to-text.md --- data/pdf-to-txt.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) create mode 100644 data/pdf-to-txt.md diff --git a/data/pdf-to-txt.md b/data/pdf-to-txt.md new file mode 100644 index 0000000..40ae28c --- /dev/null +++ b/data/pdf-to-txt.md @@ -0,0 +1,20 @@ +--- +title: "pdf to txt" +tags: [ "Documentation", "data", "pdf", "ocr" ] +--- +How to translate pdf book images to text (results are very poor, and will need lots of corrections). + +## Dependencies + +Search for 'tesseract english' (or whatever language). + +Arch: tesseract-data-eng and poppler-utils + +## Script + +> pdftoppm -png *file*.pdf test + +> for x in \*png; do +> tesseract -l eng "$x" - >> *out*.txt +> done +