Convert between Word files and HTML pages

The following examples show how you can use GemBox.Document to import and export HTML content to and from DOC, DOCX, RTF, and XML formats, in your C# and VB.NET applications.

Convert Word files to HTML or MHTML

The following example shows how you can convert a Word file to HTML with embedded images and semantic elements.

Upload your file (Drag file here)
using GemBox.Document;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        // Load input HTML file.
        DocumentModel document = DocumentModel.Load("%InputFileName%");

        // When reading any HTML content a single Section element is created.
        // We can use that Section element to specify various page options.
        Section section = document.Sections[0];
        PageSetup pageSetup = section.PageSetup;
        PageMargins pageMargins = pageSetup.PageMargins;
        pageMargins.Top = pageMargins.Bottom = pageMargins.Left = pageMargins.Right = 0;

        // Save output PDF file.
        document.Save("Output.%OutputFileType%");
    }
}
Imports GemBox.Document

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        ' Load input HTML file.
        Dim document As DocumentModel = DocumentModel.Load("%InputFileName%")

        ' When reading any HTML content a single Section element is created.
        ' We can use that Section element to specify various page options.
        Dim section As Section = document.Sections(0)
        Dim pageSetup As PageSetup = section.PageSetup
        Dim pageMargins As PageMargins = pageSetup.PageMargins
        With pageMargins
            .Left = 0
            .Right = 0
            .Top = 0
            .Bottom = 0
        End With

        ' Save output PDF file.
        document.Save("Output.%OutputFileType%")

    End Sub
End Module
Converted HTML web page to PDF format in C# and VB.NET
Screenshot of input HTML and converted output PDF

GemBox.Document creates a well-formed HTML file from the Word document's rich content and images. The images are extracted as separate files to HtmlSaveOptions.FilesDirectoryPath and referenced relative to HtmlSaveOptions.FilesDirectorySrcPath.

Alternatively, you can specify that images should be embedded directly into the HTML file as base64-encoded data (Data URLs image source) using the HtmlSaveOptions.EmbedImages property.

You can also convert your Word file to web archive format (MHTML format) which is useful for creating a web page with concatenated resources or creating an email message.

By default GemBox.Document will reference images within MHTML files with Content-Location headers. However, some MHTML viewers, like Microsoft Outlook, fail to load such resources. In that case, you can switch to Content-ID (CID) references using the HtmlSaveOptions.UseContentIdHeaders property.

HTML styles and fonts in PDF

GemBox.Document supports inline styling, internal and external stylesheet. It uses a subset of CSS properties and some additional arbitrary properties from Microsoft Word (like mso-pagination and mso-rotate). It also uses a print type media rule (e.g. @media print { ... }).

To get the most accurate PDF conversion, you should provide printer‑friendly HTML pages to GemBox.Document. In other words, your website's content and structure should ideally be optimized for print.

There are often differences when targeting screen or print type media, which is why it is common practice to add a separate print stylesheet to the HTML after the standard stylesheet (e.g. <link media="print" href="print.css" />). Alternatively, you can use the print type media rule in your existing stylesheet.

Note, when converting a HTML page to a PDF document, the machine that's executing the code should have the fonts that are used on the website installed on it. If not, you can provide them as custom or embedded fonts.

Convert HTML to PDF with headers and footers

GemBox.Document supports reading various page options (like margins, size, and orientation) and page styles (like borders and color) from the HTML content itself, through @page directive or <body> CSS properties.

Also, GemBox.Document supports creating HeaderFooter elements from HTML content. If <header> is the first element in the HTML file, then its content will be read as a document's default header; if <footer> is the last element in the HTML file, then its content will be read as a document's default footer.

The following example shows how you can create a PDF file from HTML text, with pages that have landscape orientation and repeated headers and footers.

using GemBox.Document;
using System.IO;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        var html = @"
<html>
<style>
  @page {
    size: A5 landscape;
    margin: 6cm 1cm 1cm;
    mso-header-margin: 1cm;
    mso-footer-margin: 1cm;
  }

  body {
    background: #EDEDED;
    border: 1pt solid black;
    padding: 20pt;
  }

  br {
    page-break-before: always;
  }

  p { margin: 0; }
  header { color: #FF0000; text-align: center; }
  main { color: #00B050; }
  footer { color: #0070C0; text-align: right; }
</style>

<body>
  <header>
    <p>Header text.</p>
  </header>
  <main>
    <p>First page.</p>
    <br>
    <p>Second page.</p>
    <br>
    <p>Third page.</p>
    <br>
    <p>Fourth page.</p>
  </main>
  <footer>
    <p>Footer text.</p>
    <p>Page <span style='mso-field-code:PAGE'>1</span> of <span style='mso-field-code:NUMPAGES'>1</span></p>
  </footer>
</body>
</html>";

        var htmlLoadOptions = new HtmlLoadOptions();
        using (var htmlStream = new MemoryStream(htmlLoadOptions.Encoding.GetBytes(html)))
        {
            // Load input HTML text as stream.
            var document = DocumentModel.Load(htmlStream, htmlLoadOptions);
            // Save output PDF file.
            document.Save("OutputWithHeaderFooter.%OutputFileType%");
        }
    }
}
Imports GemBox.Document
Imports System.IO

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Dim html = "
<html>
<style>
  @page {
    size: A5 landscape;
    margin: 6cm 1cm 1cm;
    mso-header-margin: 1cm;
    mso-footer-margin: 1cm;
  }

  body {
    background: #EDEDED;
    border: 1pt solid black;
    padding: 20pt;
  }

  br {
    page-break-before: always;
  }

  p { margin: 0; }
  header { color: #FF0000; text-align: center; }
  main { color: #00B050; }
  footer { color: #0070C0; text-align: right; }
</style>

<body>
  <header>
    <p>Header text.</p>
  </header>
  <main>
    <p>First page.</p>
    <br>
    <p>Second page.</p>
    <br>
    <p>Third page.</p>
    <br>
    <p>Fourth page.</p>
  </main>
  <footer>
    <p>Footer text.</p>
    <p>Page <span style='mso-field-code:PAGE'>1</span> of <span style='mso-field-code:NUMPAGES'>1</span></p>
  </footer>
</body>
</html>"

        Dim htmlLoadOptions As New HtmlLoadOptions()
        Using htmlStream As New MemoryStream(htmlLoadOptions.Encoding.GetBytes(html))

            ' Load input HTML text as stream.
            Dim document = DocumentModel.Load(htmlStream, htmlLoadOptions)
            ' Save output PDF file.
            document.Save("OutputWithHeaderFooter.%OutputFileType%")

        End Using

    End Sub
End Module
Converted HTML content to PDF with headers, footers and landscape in C# and VB.NET
Screenshot of converted HTML to PDF with headers, footers and landscape

See also


Next steps

GemBox.Document is a .NET component that enables you to read, write, edit, convert, and print document files from your .NET applications using one simple API. How about testing it today?

Download Buy