Optical character recognition (OCR)

With the example below, you can learn how to use GemBox.Pdf OCR feature to load text inside images and scanned PDF files into a PdfDocument in C# and VB.NET.

using GemBox.Pdf;
using GemBox.Pdf.Content;
using GemBox.Pdf.Ocr;
using System;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        using (PdfDocument document = OcrReader.Read("%#BookPage.jpg%"))
        {
            var page = document.Pages[0];
            var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

            while (contentEnumerator.MoveNext())
            {
                if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                {
                    var textElement = (PdfTextContent)contentEnumerator.Current;
                    Console.WriteLine(textElement.ToString());
                }
            }
        }
    }
}
Imports GemBox.Pdf
Imports GemBox.Pdf.Content
Imports GemBox.Pdf.Ocr
Imports System

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Using document = OcrReader.Read("%#BookPage.jpg%")

            Dim page = document.Pages(0)
            Dim contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator()

            While contentEnumerator.MoveNext()
                If contentEnumerator.Current.ElementType = PdfContentElementType.Text Then
                    Dim textElement = CType(contentEnumerator.Current, PdfTextContent)
                    Console.WriteLine(textElement.ToString())
                End If
            End While

        End Using

    End Sub
End Module
Text extracted from the image with the GemBox.Pdf.Ocr C#/VB.NET library
Screenshot of text extracted from the image with the GemBox.Pdf.Ocr library

GemBox.Pdf.Ocr internally uses Tesseract to perform optical character recognition. That's why it is necessary to have leptonica-1.82.0.dll and tesseract50.dll present in the x64 or x86 folder in the output directory. These DLLs are distributed together with GemBox.Pdf.Ocr and they were compiled with Visual Studio 2019. Therefore you'll need to ensure you have the Visual Studio 2019 Runtime installed.

In many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving to GemBox.Pdf.Ocr.

OCR with different languages

Language data is necessary to perform optical character recognition with Tesseract.

GemBox.Pdf.Ocr comes with data for the English language inside the gembox_tesseract_data folder. You'll need to download the language data and put it inside a dedicated folder copied to the output directory to support other languages.

The following example shows how to load a scanned PDF file with German text and save it to an editable PDF file. The resulting image also shows the shortcomings of OCR when reading unclear text.

using GemBox.Pdf;
using GemBox.Pdf.Ocr;
using System;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        // TesseractDataPath specifies the directory which contains language data.
        // You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
        var readOptions = new OcrReadOptions() { TesseractDataPath = "languagedata" };

        // The language of the text.
        readOptions.Languages.Add(OcrLanguages.German);

        using (PdfDocument document = OcrReader.Read("%#GermanDocument.pdf%", readOptions))
        {
            document.Save("GermanDocumentEditable.pdf");
        }
    }
}
Imports GemBox.Pdf
Imports GemBox.Pdf.Ocr
Imports System

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        ' TesseractDataPath specifies the directory which contains language data.
        ' You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
        Dim readOptions As New OcrReadOptions() With {.TesseractDataPath = "languagedata"}

        ' The language of the text.
        readOptions.Languages.Add(OcrLanguages.German)

        Using document = OcrReader.Read("%#GermanDocument.pdf%", readOptions)
            document.Save("GermanDocumentEditable.pdf")
        End Using

    End Sub
End Module
PDF file with recognized text using GemBox.Pdf.Orc
Screenshot of PDF file with recognized text using GemBox.Pdf.Ocr

See also


Next steps

GemBox.Pdf is a .NET component that enables developers to read, merge and split PDF files or execute low-level object manipulations from .NET applications in a simple and efficient way.

Download Buy