How to load and process a Pdf file in C#
You can use various options to read PDF files using the GemBox components. Each component has its advantages and it is suitable in different scenarios, such as the following:
- Logical loading is best for extracting text from tables and paragraphs.
- High-fidelity loading, which produces visually very similar results when converting to DOCX or other file formats.
- Loading using GemBox.Pdf, which gives you low-level control when editing a PDF file.
You can browse through the sections below to completely understand how to choose the right option for loading PDF files:
Logical loading
PDF is a fixed document format, which means that the location of every text, border line, background fill, etc. is specified in page coordinates and is, potentially, transformed. Whereas GemBox.Document model is a flow document format, such as HTML, for example. Therefore to read a PDF file into a GemBox.Document, elements such as Tables and Paragraphs must be recognized from PDF-positioned text and lines/paths.
The recognition of PDF logical structure in GemBox.Document is based on various heuristics that we have implemented and plan to improve and extend over time based on customer feedback. However, note that a fully correct recognition is impossible to achieve just by reading the content of PDF pages because higher level information is required to disambiguate certain cases.
For example, a PDF page with text in two columns could be a table with a single row and two cells or a section with two columns. Or, a PDF page with a single small line of text in the middle of it could be a paragraph with left alignment and left indentation, right alignment, and right indentation, or some other combination.
Logical loading is the default option when loading PDF files in GemBox.Document.
var document = DocumentModel.Load("Input.pdf");
Which has the same effect as explicitly specifying the loading type on PdfLoadOptions
.
var document = DocumentModel.Load("Input.pdf", new PdfLoadOptions()
{
LoadType = PdfLoadType.Logical
});
You can then work with the loading document. For example you can print all paragraphs in the document to the console.
foreach (var paragraph in document.GetChildElements(true, ElementType.Paragraph))
Console.WriteLine(paragraph.Content.ToString());
High-fidelity loading
The high-fidelity loading uses text frames and text boxes to position the text in the same location as it appeared on the PDF page. The PDF page graphics are converted to shapes or rendered into temporary images that are then inserted into a page.
Although the output of this approach looks very similar or identical to the input PDF, it has the following drawbacks:
- The logical structure of the document is not available. For example, if you have a table in a PDF file and you want to extract the content of a cell in the second row and third column, that is not possible since there is no table.
- Text search is limited - Since logically connected text segments might end up in different text frames, looking for a term that spans two or more text frames is not possible.
- Editing is limited - Since text segments are absolutely positioned on a page using text frames, removing or adding new text doesn't reflow the rest of the content; the positions of all text frames are independent of each other.
To load a PDF file using high-fidelity loading, you can use PdfLoadType.HighFidelity
var document = DocumentModel.Load("Input.pdf", new PdfLoadOptions()
{
LoadType = PdfLoadType.HighFidelity
});
You can then save the document, for example to DOCX file format.
document.Save("Output.docx");
Loading using GemBox.Pdf
Alternatively, you can choose the GemBox.Pdf component
using GemBox.Pdf;
And then you can use this component to load the PDF file.
using (var document = PdfDocument.Load("Reading.pdf"))
{
// Work with the document
}
This gives you lower-level access to the PDF elements and gives you more control over editing the document and more precise information when extracting content and properties of PDF elements.
For example, you can iterate over all pages and print their content to the console.
foreach (var page in document.Pages)
{
Console.WriteLine(page.Content.ToString());
}
To learn more about GemBox.Pdf you can check out its examples.
How to choose the loading approach
The following table can help you choose which option is the best for your use case.
Logical loading with GemBox.Document | High-fidelity loading with GemBox.Document | Loading with GemBox.Pdf | |
---|---|---|---|
Summary | The file is loaded by trying to detect the logical structure of the document. | The file is loaded by absolutely positioning paragraphs, shapes, and images on pages. | The loaded model corresponds directly to the (low-level) PDF specification. |
Advantages |
|
|
|
Disadvantages |
|
|
|
When to use |
|
|
|
Conclusion
In this article, you learned how to load your PDF files in various ways, using either GemBox.Document or GemBox.Pdf. After this comparison,s you will be able to choose the proper component you should work with while loading your PDF documents in C#.
For more information, check the GemBox.Pdf documentation, and th GemBox.Document documentation.
If you have any questions regarding this article, refer to our forum page or submit a ticket to our technical support.