add pdf-to-text.md

This commit is contained in:
Malin Freeborn 2022-08-31 20:00:29 +02:00
parent c21857da46
commit d2746b505d
Signed by: andonome
GPG Key ID: 52295D2377F4D70F

20
data/pdf-to-txt.md Normal file
View File

@ -0,0 +1,20 @@
---
title: "pdf to txt"
tags: [ "Documentation", "data", "pdf", "ocr" ]
---
How to translate pdf book images to text (results are very poor, and will need lots of corrections).
## Dependencies
Search for 'tesseract english' (or whatever language).
Arch: tesseract-data-eng and poppler-utils
## Script
> pdftoppm -png *file*.pdf test
> for x in \*png; do
> tesseract -l eng "$x" - >> *out*.txt
> done