2022-08-31 18:00:29 +00:00
|
|
|
---
|
|
|
|
title: "pdf to txt"
|
|
|
|
tags: [ "Documentation", "data", "pdf", "ocr" ]
|
|
|
|
---
|
|
|
|
How to translate pdf book images to text (results are very poor, and will need lots of corrections).
|
|
|
|
|
|
|
|
## Dependencies
|
|
|
|
|
|
|
|
Search for 'tesseract english' (or whatever language).
|
|
|
|
|
|
|
|
Arch: tesseract-data-eng and poppler-utils
|
|
|
|
|
|
|
|
## Script
|
|
|
|
|
2023-06-17 19:28:20 +00:00
|
|
|
```bash
|
|
|
|
pdftoppm -png *file*.pdf test
|
|
|
|
```
|
2022-08-31 18:00:29 +00:00
|
|
|
|
2023-06-17 19:28:20 +00:00
|
|
|
```bash
|
|
|
|
for x in \*png; do
|
|
|
|
tesseract -l eng "$x" - >> *out*.txt
|
|
|
|
done
|
|
|
|
```
|
2022-08-31 18:00:29 +00:00
|
|
|
|