add pdf-to-text.md
This commit is contained in:
parent
c21857da46
commit
d2746b505d
20
data/pdf-to-txt.md
Normal file
20
data/pdf-to-txt.md
Normal file
@ -0,0 +1,20 @@
|
||||
---
|
||||
title: "pdf to txt"
|
||||
tags: [ "Documentation", "data", "pdf", "ocr" ]
|
||||
---
|
||||
How to translate pdf book images to text (results are very poor, and will need lots of corrections).
|
||||
|
||||
## Dependencies
|
||||
|
||||
Search for 'tesseract english' (or whatever language).
|
||||
|
||||
Arch: tesseract-data-eng and poppler-utils
|
||||
|
||||
## Script
|
||||
|
||||
> pdftoppm -png *file*.pdf test
|
||||
|
||||
> for x in \*png; do
|
||||
> tesseract -l eng "$x" - >> *out*.txt
|
||||
> done
|
||||
|
Loading…
Reference in New Issue
Block a user