Open and read Word files in C# and VB.NET

With GemBox.Document you can open and read many Word file formats (like DOCX, DOC, RTF, ODT and HTML) in the same manner. The documents can be loaded using one of the DocumentModel.Load methods from your C# and VB.NET application. These methods enable you to work with a physical file (when providing the file's path) or with an in-memory file (when providing the file's Stream).

You can specify the format of your Word file by providing an object from the LoadOptions derived class (like DocxLoadOptions, DocLoadOptions, RtfLoadOptions, and HtmlLoadOptions). Or you can let GemBox.Document choose the appropriate options for you when opening the file by omitting the LoadOptions.

The following example shows the easiest way how you can read the document's text from a Word file.

Opening and reading Word document's text in C# and VB.NET
Screenshot of read text from input Word document
Upload your file (Drag file here)
using GemBox.Document;
using System;
using System.Linq;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        // Load Word document from file's path.
        var document = DocumentModel.Load("%InputFileName%");

        // Get Word document's plain text.
        string text = document.Content.ToString();

        // Get Word document's count statistics.
        int charactersCount = text.Replace(Environment.NewLine, string.Empty).Length;
        int wordsCount = document.Content.CountWords();
        int paragraphsCount = document.GetChildElements(true, ElementType.Paragraph).Count();
        int pageCount = document.GetPaginator().Pages.Count;

        // Display file's count statistics.
        Console.WriteLine($"Characters count: {charactersCount}");
        Console.WriteLine($"     Words count: {wordsCount}");
        Console.WriteLine($"Paragraphs count: {paragraphsCount}");
        Console.WriteLine($"     Pages count: {pageCount}");
        Console.WriteLine();

        // Display file's text content.
        Console.WriteLine(text);
    }
}
Imports GemBox.Document
Imports System
Imports System.Linq

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        ' Load Word document from file's path.
        Dim document = DocumentModel.Load("%InputFileName%")

        ' Get Word document's plain text.
        Dim text As String = document.Content.ToString()

        ' Get Word document's count statistics.
        Dim charactersCount As Integer = text.Replace(Environment.NewLine, String.Empty).Length
        Dim wordsCount As Integer = document.Content.CountWords()
        Dim paragraphsCount As Integer = document.GetChildElements(True, ElementType.Paragraph).Count()
        Dim pageCount As Integer = document.GetPaginator().Pages.Count

        ' Display file's count statistics.
        Console.WriteLine($"Characters count: {charactersCount}")
        Console.WriteLine($"     Words count: {wordsCount}")
        Console.WriteLine($"Paragraphs count: {paragraphsCount}")
        Console.WriteLine($"     Pages count: {pageCount}")
        Console.WriteLine()

        ' Display file's text content.
        Console.WriteLine(text)

    End Sub
End Module

Reading Word document's elements

Besides reading the text of the whole document, you can also read just some part of it, like a specific Section element or HeaderFooter element. Each element has a Content property with which you can extract its text via the Content.ToString method.

The following example shows how you can open a document and traverse through all Paragraph elements and their child Run elements, and read their text and formatting. To read more about the visual information of the content elements, see the Formats and Styles help page.

Opening and reading Word document's text and formatting in C# and VB.NET
Screenshot of read elements from input Word document
Upload your file (Drag file here)
using GemBox.Document;
using System;
using System.IO;
using System.Linq;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        var document = DocumentModel.Load("%InputFileName%");
        using (var writer = File.CreateText("Output.txt"))
        {
            // Iterate through all Paragraph elements in the Word document.
            foreach (Paragraph paragraph in document.GetChildElements(true, ElementType.Paragraph))
            {
                // Iterate through all Run elements in the Paragraph element.
                foreach (Run run in paragraph.GetChildElements(true, ElementType.Run))
                {
                    string text = run.Text;
                    CharacterFormat format = run.CharacterFormat;

                    // Replace text with bold formatting to 'Mathematical Bold Italic' Unicode characters.
                    // For instance, "ABC" to "𝑨𝑩𝑪".
                    if (format.Bold)
                    {
                        text = string.Concat(text.Select(
                            c => c >= 'A' && c <= 'Z' ? char.ConvertFromUtf32(119847 + c) :
                                 c >= 'a' && c <= 'z' ? char.ConvertFromUtf32(119841 + c) :
                                 c.ToString()));
                    }

                    writer.Write(text);
                }

                writer.WriteLine();
            }
        }
    }
}
Imports GemBox.Document
Imports System
Imports System.IO
Imports System.Linq

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Dim document = DocumentModel.Load("%InputFileName%")
        Using writer = File.CreateText("Output.txt")

            ' Iterate through all Paragraph elements in the Word document.
            For Each paragraph As Paragraph In document.GetChildElements(True, ElementType.Paragraph)

                ' Iterate through all Run elements in the Paragraph element.
                For Each run As Run In paragraph.GetChildElements(True, ElementType.Run)

                    Dim text As String = run.Text
                    Dim format As CharacterFormat = run.CharacterFormat

                    ' Replace text with bold formatting to 'Mathematical Bold Italic' Unicode characters.
                    ' For instance, "ABC" to "𝑨𝑩𝑪".
                    If format.Bold Then
                        text = String.Concat(text.Select(
                            Function(c)
                                Return If(c >= "A"c AndAlso c <= "Z"c, Char.ConvertFromUtf32(119847 + AscW(c)),
                                       If(c >= "a"c AndAlso c <= "z"c, Char.ConvertFromUtf32(119841 + AscW(c)),
                                       c.ToString()))
                            End Function))
                    End If

                    writer.Write(text)
                Next

                writer.WriteLine()
            Next
        End Using

    End Sub
End Module

By combining these two examples you can achieve various tasks, like selecting only the Table elements and reading their text content, or selecting only the Picture elements and extracting their images, or reading the Run.Text property of only the highlighted elements (the ones that have CharacterFormat.HighlightColor).

Reading Word document's pages

Word files (DOCX, DOC, RTF, HTML, etc.) don't have a page concept, which means they don't contain information about how many pages they occupy nor which element is on which page.

They are of a flow document type and their content is written in a flow-able manner. The page concept is specific to the Word application(s) that renders or displays the document.

On the other hand, files of fixed document type (PDF, XPS, etc.) do have a page concept. Their content is fixed: it's defined on which exact page location the elements are rendered.

GemBox.Document uses its rendering engine to paginate and render the document's content when saving to PDF, XPS, or image format.

The following example shows how you can use GemBox.Document's rendering engine to retrieve each document page as a DocumentModelPage object and read the page's text.

Opening and reading Word document's page in C# and VB.NET
Screenshot of read page from input Word document
Upload your file (Drag file here)
using GemBox.Document;
using System;
using System.IO;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        var document = DocumentModel.Load("%InputFileName%");
        var pages = document.GetPaginator().Pages;

        int pageNumber = 1;
        foreach (DocumentModelPage page in pages)
        {
            Console.WriteLine(new string('-', 50));
            Console.WriteLine($"Page {pageNumber++} of {pages.Count}");
            Console.WriteLine(new string('-', 50));

            using (var stream = new MemoryStream())
            {
                // Save Word document's page to TXT file.
                var txtSaveOptions = SaveOptions.TxtDefault;
                page.Save(stream, txtSaveOptions);

                // Display page's extracted text.
                Console.WriteLine(txtSaveOptions.Encoding.GetString(stream.ToArray()));
            }
        }
    }
}
Imports GemBox.Document
Imports System
Imports System.IO

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Dim document = DocumentModel.Load("%InputFileName%")
        Dim pages = document.GetPaginator().Pages

        Dim pageNumber As Integer = 1
        For Each page As DocumentModelPage In pages

            Console.WriteLine(New String("-"c, 50))
            Console.WriteLine($"Page {pageNumber} of {pages.Count}")
            pageNumber += 1
            Console.WriteLine(New String("-"c, 50))

            Using stream As New MemoryStream()
                ' Save Word document's page to TXT file.
                Dim txtSaveOptions = SaveOptions.TxtDefault
                page.Save(stream, txtSaveOptions)

                ' Display page's extracted text.
                Console.WriteLine(txtSaveOptions.Encoding.GetString(stream.ToArray()))
            End Using
        Next
    End Sub
End Module

See also


Next steps

GemBox.Document is a .NET component that enables you to read, write, edit, convert, and print document files from your .NET applications using one simple API. How about testing it today?

Download Buy