Files
lk/data/pdf-to-txt.md
2026-04-28 23:56:37 +02:00

29 lines
406 B
Markdown

---
title: Convert a scanned pdf to text
tags:
- data
- pdf
- ocr
---
How to translate pdfs to text (results are very poor, and will need lots of corrections).
## Dependencies
Search for 'tesseract english' (or whatever language).
Arch: tesseract-data-eng and poppler-utils
## Script
```sh
pdftoppm -png *file*.pdf test
```
```sh
for x in *png; do
tesseract -l eng "$x" - >> out.txt
done
```