lk/data/pdf-to-txt.md
Malin Freeborn ba8026e0c3
change formatting
input examples are now given as

```bash
input $ARG1
```

While outputs use md's '> ' sign as a quote.
2023-06-17 21:28:20 +02:00

25 lines
428 B
Markdown

---
title: "pdf to txt"
tags: [ "Documentation", "data", "pdf", "ocr" ]
---
How to translate pdf book images to text (results are very poor, and will need lots of corrections).
## Dependencies
Search for 'tesseract english' (or whatever language).
Arch: tesseract-data-eng and poppler-utils
## Script
```bash
pdftoppm -png *file*.pdf test
```
```bash
for x in \*png; do
tesseract -l eng "$x" - >> *out*.txt
done
```