File Structure
The PDF file structure determines how objects are stored in a PDF file, how they are accessed, and how they are updated. This structure is independent of the semantics of the objects.
The following section describes how GemBox.Pdf provides loading and saving of a PDF document to a PDF file, efficient random access of objects in a PDF file, incremental updates of a PDF file and other PDF file-related functionalities.
File Structure
Implementation of the PDF file structure is currently not exposed through GemBox.Pdf interface.
Interface to the PDF file structure in GemBox.Pdf is exposed through the following members:
- PdfDocument.Load methods,
- PdfDocument.Save methods,
- PdfLoadOptions type and its members and
- PdfSaveOptions type and its members.
The following subsections give more details about each PDF file-structure-related operation in GemBox.Pdf.
Creating a new PDF document
To create a new in-memory PDF document, use the PdfDocument constructor.
It will create an empty PDF document. To make the PDF document valid, at least one page should be added to it after its creation.
This PDF document is contained entirely in memory and is not associated with any PDF file.
PdfIndirectObjects created with GemBox.Pdf will have an Id equal to Undefined until they are written to a PDF file.
Loading a PDF document from a PDF file
PdfDocument can be loaded from a PDF file either by specifying a path to a PDF file via PdfDocument.Load(String) or PdfDocument.Load(String, PdfLoadOptions) methods or by specifying a PDF file stream via PdfDocument.Load(Stream) or PdfDocument.Load(Stream, PdfLoadOptions) methods.
Overloads that do not accept PdfLoadOptions as a parameter will use Default.
A PDF document can be loaded in a read-only mode to prevent accidental changes to the PDF file. For more information, see ReadOnly property.
PdfIndirectObjects read from the PDF file will have a unique Id that is different than Undefined. PdfIndirectObjects created with GemBox.Pdf will have an Id equal to Undefined until they are written to a PDF file.
Important
A loaded PDF document is associated with the PDF file from which it was loaded, and the PDF file remains opened until Close() or Dispose() method is called. Any PDF document that is associated with the PDF file should be closed (disposed), otherwise memory and resource leaks might occur because the PDF file stream might not be closed until the application exists.
Loading a PDF document fully to a memory
A PDF file associated with the loaded PdfDocument must remain open because GemBox.Pdf reads the PDF file in a lazy fashion (indirect object values are parsed from the PDF file only when they are requested for the first time). This feature enables GemBox.Pdf to perform fast reading and updating of the PDF file.
If you want to dispose the associated PDF file, but still want to be able to fully use the PdfDocument instance, then PdfDocument instance first must be fully loaded from the associated PDF file to memory by using the Load() instance method. Then the associated PDF file can be disposed as explained in the closing-the-associated-pdf-file subsection.
Effectively, this operation will read the PDF file in an eager fashion (values of all indirect objects accessible from the PdfDocument instance will be, if not already, parsed at that point).
If you do not load the PdfDocument instance to a memory before disposing the associated PDF file, then an exception might occur if some indirect object's value is requested for the first time and it is not possible to parse it from a closed PDF file.
Unloading a PDF document from memory
PdfDocument can be reset to the state when it was initially loaded from a PDF file by using the Unload() instance method. This feature enables GemBox.Pdf to efficiently work with large PDF documents. For instance, when reading a very large PDF document (with thousands of pages) you can free the memory necessary to read additional pages.
The references to all PdfIndirectObjects whose Values were already parsed from the PDF file associated with this PdfDocument are cleared so they can be parsed again when requested for the first time.
Closing the associated PDF file
If PdfDocument instance will no longer be used, but is associated with the PDF file (either because it was loaded from a PDF file or saved to a PDF file), then it must be closed or disposed by calling Close() or Dispose() method.
This operation will close the associated PDF file.
At this point, all PdfIndirectObjects in a PDF document will have an Id equal to Undefined until they are written to a PDF file.
PdfDocument is still fully usable at this point if it was loaded to memory before, as explained in the Loading a PDF document fully to a memory subsection.
Note
Closing/disposing the PdfDocument instance with Close() or Dispose() method does not mean that the PdfDocument instance cannot longer be used, it just means that the associated PDF file is closed and that the PdfDocument instance is no longer associated with any PDF file.
Saving the PDF document to a new PDF file
PdfDocument can be saved to a new PDF file by specifying a path to a PDF file via PdfDocument.Save(String) method or by specifying a PDF file stream via PdfDocument.Save(Stream) method.
All save operations on the PdfDocument use the same PdfSaveOptions instance specified in the SaveOptions property to control the details of the output PDF file structure. If SaveOptions property is not specified, it will be set to a copy of the current Default.
Various PDF file structure settings can be specified via PdfSaveOptions. Among them is CrossReferenceType that enables you to specify if the output PDF file will be compressed (information about the location of the indirect objects in the PDF file and the indirect objects are written compactly and are compressed) or not. For more details, see PdfCrossReferenceType enumeration. Note that some settings are applicable only if a PDF document is saved to a new PDF file, while others are applicable only if it is saved to the same PDF file as explained in the Incremental update subsection.
After the save operation, PdfIndirectObjects whose Id was Undefined will now have a unique Id that is different than Undefined.
Important
A saved PDF document is associated with the PDF file to which it was saved, and the PDF file remains open until Close() or Dispose() method is called. Any PDF document that is associated with the PDF file should be closed (disposed), otherwise memory and resource leaks might occur because the PDF file stream might not be closed until the application exists.
Saving the PDF document to another file format
PdfDocument can be saved to another file format, such as an image (PNG, JPEG, GIF, BMP, TIFF or WMP) or XPS, by specifying a path to a file with an appropriate file extension via PdfDocument.Save(String) method, by specifying a path to a file and saving options via PdfDocument.Save(String, SaveOptions) method or by specifying a file stream and saving options via PdfDocument.Save(Stream, SaveOptions) method.
Depending on the file extension, PdfDocument.Save(String) method uses one of the following SaveOptions instances:
None or .pdf
PdfSaveOptions instance specified in the SaveOptions property.
.png, .jpg, .jpeg, .gif, .bmp, .tif, .tiff or .wdp
ImageSaveOptions instance returned by the static Image property with the Format property set to the appropriate ImageSaveFormat value based on the file extension.
.xps
XpsSaveOptions instance returned by the static Xps property.
Important
If a PDF document is saved to an image (PNG, JPEG, GIF, BMP, TIFF or WMP) or XPS file, it remains associated with the same PDF file with which it was associated before the save operation. PDF document can be associated only with a file whose format is PDF.
Saving the PDF document to the same PDF file (incremental update)
Changes made to the PdfDocument after the load or the last save operation can be saved to the same PDF file by using Save() method.
GemBox.Pdf is able to automatically determine what objects have been changed, and if no object has been changed, then the PDF file won't be updated.
Incremental save operation on the PdfDocument uses the PdfSaveOptions instance specified in the SaveOptions property to control the details of the output PDF file structure that will be appended with the changed and new objects. If SaveOptions property is not specified, it will be set to a copy of the current Default.
Note that some settings are applicable only if a PDF document is saved to a new PDF file, while others are applicable only if it is saved to the same PDF file.
Tip
Using the incremental update is the preferred way of making small changes to (potentially) large PDF files because it utilizes less memory.
After the incremental save operation, PdfIndirectObjects, whose Id was Undefined, will now have a unique Id that is different than Undefined.
Important
A PDF document is associated with the PDF file to which it was incrementally updated, and the PDF file remains open until Close() or Dispose() method is called. Any PDF document that is associated with the PDF file should be closed (disposed), otherwise memory and resource leaks might occur because the PDF file stream might not be closed until the application exists.
See Also
PDF Specification ISO 32000-1:2008, section '7.5 File Structure'