Fonts embedding in the PDF export in FR VCL 5

What is fonts "embedding"

PDF documents often contains text written with various fonts. In order Acrobat Reader or another pdf viewer can display this text, it must have access to font files used in the document. If the OS where you open the document doesn't have needed fonts, the document may become unreadable. For the sake of resolving this issue, the PDF standard allows to copy font files into a pdf document, thus providing the guarantee, that wherever you open this document, fonts will be available and the document will be readable. This copying of font files is called "embedding". Of course, here a problem appears: font files usually occupy much space and a pdf document may be become unacceptable because of big file size. The PDF export in FR4 is capable to embed fonts, but it does that in the simplest way, by just copying all needed font files into a document. Sometimes, this leads to increasing the file size of the pdf document to more than 10 megabytes.

Fonts embedding in FR VCL 5

It was decided to improve fonts embedding in FR5: the new PDF export will extract only needed characters from used fonts. It's usually used up to 50 symbols from a font - this is explained by the fact, that a document is usually created in one language and uses symbols from one alphabet. But fonts, especially such universal fonts as Arial Unicode MS, contains from 3 to 50 thousand symbols.

For instance, let's take a simple report which has a few paragraphs written with the Arial font. The report uses only 41 symbol, but Arial contains 3415 symbols. Thus, by embedding only these 41 symbols into a pdf file, more than 700 kb space can be saved - this is the file size of Arial. Another spectacular example: export this report to pdf with enabled embedding and see the file size of the resulting document - it's more than 30 mb; if the same report is exported with the PDF export in FR5, then the resulting file is only 116 kb, and it's even opened with Acrobat Reader much faster.

In this forum topic you can take a test program that can export reports to pdf with the new PDF export.

What's inside a font file

Usually a font is represented with a TTF file; sometimes - with a TTC file, which is simply several TTF files put together. A font consists of two main parts:

A set of glyphs. Each glyphs represents a symbol or a part of a symbol. A glyph in TTF is represented with Bezier curves and, maybe, with a small picture that tells how the glyph should look like for a very small font size.
The "cmap" table that defines a mapping from 2-byte charcode to glyph indices. This table is needed because a font rarely contains glyphs for all possible Unicode values and it's needed to tell what Unicode values have representation in the font.

Some symbols are drawer with one glyph: usually symbols of the english alphabet are drawn in this way. Other symbols can be represented with several glyphs: for example symbols with accent marks consists of a glyph representing the marks and a glyph representing the rest of the symbol. Some font define glyphs that represent several characters (so called ligatures), glyphs representing a character with respect to its position in a word and other special glyphs. Thus, an algorithm that determines what glyphs are needed to draw a word is not so simple as it might be thought.

Fonts embedding implementation in the PDF export

The whole point of the embedding is that from the glyph set, from the "cmap" table and many other font tables needed information is extracted that correspond to used glyphs, following which similar font tables are created and filled in with the extracted information. As a result, a new font file is produced, but with smaller number of glyphs. An example of such a shortened font file is here.

Then, the obtained font file is written into a pdf file. Let's consider this in a report with a single line of text "Open Type Font" inside. After exporting to pdf, this file appears.

Code

%PDF-1.5
%ЂЂЂЂ

% This pdf object represents the TfrxMemoView with the text "Open Type Font".
% It chooses a font with the Tf operator and draws the text with the Tj operator.

2 0 obj
<< /Length 257 /Length1 257 >>
stream
...
/F0 10 Tf
...
<004F00700065006E0020005400790070006500200046006F006E0074> Tj
...
endstream
endobj

% The general font description.
% The field /Encoding defines a mapping from charcodes to CIDs.
% Here this mapping is identity, in other words CID of a charcode equals the charcode.

3 0 obj
<<
/Type /Font
/Subtype /Type0
/BaseFont /IJIVDA+Arial
/Encoding /Identity-H
/DescendantFonts [11 0 R]
/ToUnicode 6 0 R
>>
endobj

9 0 obj
<<
/Type /FontDescriptor
/FontName /IJIVDA+Arial
/FontFamily /IJIVDA+Arial
/FontBBox [-1361 -665 4096 2060]
/ItalicAngle 0
/Ascent 1854
/Descent -434
/CapHeight 0
/StemV 0
/Flags 32
/CIDSet 5 0 R
/FontFile2 8 0 R
>>
endobj

11 0 obj
<<
/Type /Font
/Subtype /CIDFontType2
/CIDToGIDMap 10 0 R
/BaseFont /IJIVDA+Arial
/CIDSystemInfo 7 0 R
/FontDescriptor 9 0 R
/W [ 32 [277.8] 70 [610.8] 79 [777.8] 84 [610.8] 101 [556.2] 110 [556.2] 111 [556.2] 112 [556.2] 116 [277.8] 121 [500.0] ]
>>
endobj

% This is the TTF file itself.
% Bertween "stream" and "endstream" the .ttf file with needed glyphs is written.

8 0 obj
<< /Length 15148 /Length1 10856 /Filter [ /ASCIIHexDecode /FlateDecode ] >>
stream
7801c57a7b...00175ac7e0
endstream
endobj

% The mapping from CIDs to GIDs.
% GID is a glyph index.

10 0 obj
<< /Length 86 /Length1 244 /Filter [ /ASCIIHexDecode /FlateDecode ] >>
stream
78016360a0103052a81f593b133207cc66868bb0c059b818ac18126c0cec0c1c50514eb82c1700078f0038
endstream
endobj

...
%%EOF

All font parameters are rather simple. The only interesting part here is how Tj draws text.

As an argument it accepts a sequence of 2-byte charcode. In this example these charcodes are 4f 70 65 6e ...
Each charcode is transformed into a CID, using the field /Encoding /Identity-H. Now Tj has a sequence of CIDs (it is the same: 4f 70 65 6e ...) and further it works with these CIDs.
Each CID is mapped to a GID using the field /CIDToGIDMap. GID is a glyph index. After that, Tj has a sequence of glyphs and it can draw text.

It's noteworthy that the PDF standard doesn't use the "cmap" table in the embedded font, that defines mapping from charcodes to glyphs. One reason of this is that Tj can accept not only ASCII or Unicodes, but rather it accepts a sequence of glyphs which can be ligatures and other special symbols that don't have associated charcodes.