Optical character recognition (OCR)
Optical character recognition (OCR) is a process of converting images with text into machine-encoded text. GemBox.Pdf supports OCR via the GemBox.Pdf.Ocr.dll.
Language data
These tables contain quick links for downloading trained language data which are necessary for GemBox.Pdf.Ocr to work with other languages besides English.
You can also download a zip of all files or individual files from the official Tesseract data repository. As an alternative you can check out the tessdata_best repository which contains data trained for the highest accuracy but at the price of lower speed, or the tessdata_fast repository which contains data with higher performance but lower accuracy.
Languages
Language | Language Data |
---|---|
Afrikaans | Download |
Amharic | Download |
Arabic | Download |
Assamese | Download |
Azerbaijani | Download |
Azerbaijani - Cyrillic | Download |
Belarusian | Download |
Bengali | Download |
Tibetan | Download |
Bosnian | Download |
Breton | Download |
Bulgarian | Download |
Catalan; Valencian | Download |
Cebuano | Download |
Czech | Download |
Chinese - Simplified | Download |
Chinese - Traditional | Download |
Cherokee | Download |
Corsican | Download |
Welsh | Download |
Danish | Download |
German | Download |
Dzongkha | Download |
Greek, Modern (1453-) | Download |
English | Download |
English, Middle (1100-1500) | Download |
Esperanto | Download |
Math / equation detection module | Download |
Estonian | Download |
Basque | Download |
Faroese | Download |
Persian | Download |
Filipino (old - Tagalog) | Download |
Finnish | Download |
French | Download |
German - Fraktur | Download |
French, Middle (ca.1400-1600) | Download |
Western Frisian | Download |
Scottish Gaelic | Download |
Irish | Download |
Galician | Download |
Greek, Ancient (to 1453) | Download |
Gujarati | Download |
Haitian; Haitian Creole | Download |
Hebrew | Download |
Hindi | Download |
Croatian | Download |
Hungarian | Download |
Armenian | Download |
Inuktitut | Download |
Indonesian | Download |
Icelandic | Download |
Italian | Download |
Italian - Old | Download |
Javanese | Download |
Japanese | Download |
Kannada | Download |
Georgian | Download |
Georgian - Old | Download |
Kazakh | Download |
Central Khmer | Download |
Kirghiz; Kyrgyz | Download |
Kurmanji (Kurdish - Latin Script) | Download |
Korean | Download |
Korean (vertical) | Download |
Lao | Download |
Latin | Download |
Latvian | Download |
Lithuanian | Download |
Luxembourgish | Download |
Malayalam | Download |
Marathi | Download |
Macedonian | Download |
Maltese | Download |
Mongolian | Download |
Maori | Download |
Malay | Download |
Burmese | Download |
Nepali | Download |
Dutch; Flemish | Download |
Norwegian | Download |
Occitan (post 1500) | Download |
Oriya | Download |
Panjabi; Punjabi | Download |
Polish | Download |
Portuguese | Download |
Pushto; Pashto | Download |
Quechua | Download |
Romanian; Moldavian; Moldovan | Download |
Russian | Download |
Sanskrit | Download |
Sinhala; Sinhalese | Download |
Slovak | Download |
Slovenian | Download |
Sindhi | Download |
Spanish; Castilian | Download |
Spanish; Castilian - Old | Download |
Albanian | Download |
Serbian | Download |
Serbian - Latin | Download |
Sundanese | Download |
Swahili | Download |
Swedish | Download |
Syriac | Download |
Tamil | Download |
Tatar | Download |
Telugu | Download |
Tajik | Download |
Thai | Download |
Tigrinya | Download |
Tonga | Download |
Turkish | Download |
Uighur; Uyghur | Download |
Ukrainian | Download |
Urdu | Download |
Uzbek | Download |
Uzbek - Cyrilic | Download |
Vietnamese | Download |
Yiddish | Download |
Yoruba | Download |
Scripts
Script | Script Data |
---|---|
Arabic | Download |
Armenian | Download |
Bengali | Download |
Canadian Aboriginal | Download |
Cherokee | Download |
Cyrillic | Download |
Devanagari | Download |
Ethiopic | Download |
Fraktur | Download |
Georgian | Download |
Greek | Download |
Gujarati | Download |
Gurmukhi | Download |
Han simplified | Download |
Han simplified vertical | Download |
Han traditional | Download |
Han traditional vertical | Download |
Hangul | Download |
Hangul vertical | Download |
Hebrew | Download |
Japanese | Download |
Japanese vertical | Download |
Kannada | Download |
Khmer | Download |
Lao | Download |
Latin | Download |
Malayalam | Download |
Myanmar | Download |
Oriya(Odia) | Download |
Sinhala | Download |
Syriac | Download |
Tamil | Download |
Telugu | Download |
Thaana | Download |
Thai | Download |
Tibetan | Download |
Vietnamese | Download |
Troubleshooting
GemBox.Pdf.Ocr uses the Tesseract engine under the hood which usually fails with the Error 1 or Error 2 types.
Error 1
This error occurs when the Tesseract engine fails during initialization.
Common reasons are
- The language data path does not exist or doesn't hold language data files for the requested language.
- The language data was built for a different version of Tesseract. When using Tesseract dll and language data from GemBox, this should not happen.
- The language data path contains non-ASCII characters.
Error 2
This error occurs when GemBox.Pdf.Ocr fails to load the native Tesseract and Leptonica libraries. The loading routine will try to identify the correct version of the dll that should reside in the x86 or x64 folder under your bin folder based on the executing CPU architecture.
Common reasons for failure are:
- The Visual Studio x86 & x64 Runtime is not installed.
- The x86 and x64 versions Leptonica and Tesseract were not copied to their respective folders in the bin directory.
- The project is running on unsupported architecture (e.g. ARM).
Further diagnosis
Even though the Tesseract engine only returns a success / fail response, it writes a lot more information about why the operation failed to the standard output which can be used to diagnose the error. GemBox.Pdf.Ocr also outputs some information to the Tesseract trace source which may be helpful.
You can use following diagnostics configuration:
<system.diagnostics>
<sources>
<source name="Tesseract" switchValue="Verbose">
<listeners>
<clear />
<add name="console" />
<!-- Uncomment to log to a file
<add name="file" />
-->
</listeners>
</source>
</sources>
<sharedListeners>
<add name="console" type="System.Diagnostics.ConsoleTraceListener" />
<!-- Uncomment to log to a file
<add name="file"
type="System.Diagnostics.TextWriterTraceListener"
initializeData="c:\log\tesseract.log" />
-->
</sharedListeners>
</system.diagnostics>