fixed minor bugs in Markdown (#2152)

This commit is contained in:
Ludy
2024-11-03 08:20:10 +01:00
committed by GitHub
parent 2be14788b1
commit a5aac01b4d
10 changed files with 447 additions and 427 deletions

View File

@@ -3,34 +3,36 @@
This document provides instructions on how to add additional language packs for the OCR tab in Stirling-PDF, both inside and outside of Docker.
## My OCR used to work and now doesn't!
The paths have changed for the tessadata locations on new docker images, please use ``/usr/share/tessdata`` (Others should still work for backwards compatibility but might not)
The paths have changed for the tessdata locations on new Docker images. Please use `/usr/share/tessdata` (Others should still work for backward compatibility but might not).
## How does the OCR Work
Stirling-PDF uses [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) which in turn uses tesseract for its text recognition.
All credit goes to them for this awesome work!
Stirling-PDF uses [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF), which in turn uses Tesseract for its text recognition. All credit goes to them for this awesome work!
## Language Packs
Tesseract OCR supports a variety of languages. You can find additional language packs in the Tesseract GitHub repositories:
- [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast): These language packs are smaller and faster to load, but may provide lower recognition accuracy.
- [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast): These language packs are smaller and faster to load but may provide lower recognition accuracy.
- [tessdata](https://github.com/tesseract-ocr/tessdata): These language packs are larger and provide better recognition accuracy, but may take longer to load.
Depending on your requirements, you can choose the appropriate language pack for your use case. By default Stirling-PDF uses the tessdata_fast eng but this can be replaced.
Depending on your requirements, you can choose the appropriate language pack for your use case. By default, Stirling-PDF uses `tessdata_fast` for English, but this can be replaced.
### Installing Language Packs
1. Download the desired language pack(s) by selecting the `.traineddata` file(s) for the language(s) you need.
2. Place the `.traineddata` files in the Tesseract tessdata directory: `/usr/share/tessdata`
# DO NOT REMOVE EXISTING ENG.TRAINEDDATA, IT'S REQUIRED.
**DO NOT REMOVE EXISTING `eng.traineddata`, IT'S REQUIRED.**
#### Docker
### Docker Setup
If you are using Docker, you need to expose the Tesseract tessdata directory as a volume in order to use the additional language packs.
#### Docker Compose
Modify your `docker-compose.yml` file to include the following volume configuration:
#### Docker Compose
Modify your `docker-compose.yml` file to include the following volume configuration:
```yaml
services:
@@ -40,18 +42,19 @@ services:
- /location/of/trainingData:/usr/share/tessdata
```
#### Docker Run
Add the following to your existing Docker run command:
#### Docker run
Add the following to your existing docker run command
```bash
-v /location/of/trainingData:/usr/share/tessdata
```
#### Non-Docker
If you are not using Docker, you need to install the OCR components, including the ocrmypdf app.
You can see [OCRmyPDF install guide](https://ocrmypdf.readthedocs.io/en/latest/installation.html)
### Non-Docker Setup
Debian based systems, install languages with this command:
If you are not using Docker, you need to install the OCR components, including the `ocrmypdf` app. You can see the [OCRmyPDF install guide](https://ocrmypdf.readthedocs.io/en/latest/installation.html).
For Debian-based systems, install languages with this command:
```bash
sudo apt update &&\
@@ -65,7 +68,7 @@ apt search tesseract-ocr-
dpkg-query -W tesseract-ocr- | sed 's/tesseract-ocr-//g'
```
Fedora:
For Fedora:
```bash
# All languages