lk/data/pdf-to-txt.md

26 lines
401 B
Markdown
Raw Normal View History

2022-08-31 18:00:29 +00:00
---
title: "pdf to txt"
2025-02-11 19:47:50 +00:00
tags: [ "data", "pdf", "ocr" ]
2022-08-31 18:00:29 +00:00
---
2025-02-12 21:50:27 +00:00
How to translate pdfs to text (results are very poor, and will need lots of corrections).
2022-08-31 18:00:29 +00:00
## Dependencies
Search for 'tesseract english' (or whatever language).
Arch: tesseract-data-eng and poppler-utils
## Script
```bash
pdftoppm -png *file*.pdf test
```
2022-08-31 18:00:29 +00:00
```bash
for x in \*png; do
tesseract -l eng "$x" - >> *out*.txt
done
```
2022-08-31 18:00:29 +00:00