PDF Files | Wordbee

The following file extensions are supported when setting up file format configurations for PDF files: .pdf.

On this page:

Default PDF Configuration

Every file format has a default configuration to ensure that a file can be translated; however, it does not handle every complex property that could be thrown your way when translating a source file. The default configuration for PDF files does the following:

Extracts all text.
Does not show leading and trailing whitespaces.
Converts sequences of multiple whitespaces into markup.
Does not show leading and trailing characters that are neither letters nor digits.
Converts words containing neither letters nor digits into markup.
Uses SRX Rules for text segmentation.
Always splits text at line breaks.
Ignores formatting changes applied to whitespaces.
Ignores Asian font changes if the source language is not Asian.

Custom PDF Configurations

If you are performing a code file translation, a custom file format configuration might be necessary to achieve the right results in your target file. Outside of the default configuration selections, Wordbee Translator offers many additional choices for configuring:

When setting up a file format configuration for PDF files, there are many options to choose from to ensure the extraction is successful.

General

The General tab provides options for handling whitespaces, symbols, and text segmentation.

Content
- Text can be extracted from many .pdf files, such as those that are created from word-processing software and are not locked. For such files, you can set up a PDF file format configuration that causes the text to be extracted to the Wordbee Translator editor. The system does not perform optical character recognition (OCR) on the document.
  Restriction: Text is not extracted in the following cases:
  - The text is part of an image.
  - Your Wordbee Translator subscription does not include the optional PDF-to-text converter.
  - The .pdf file is part of a Beebox Regular project.
- For other .pdf files, such as those created by scanning documents, you can set up a PDF file format configuration that does not extract text, but instead creates empty segments in the editor. Translators can then view the source file and enter translations in the editor, in the created segments.
Whitespaces and Symbols - Options in this configuration section make it possible to show or hide leading and trailing whitespaces, convert sequences of whitespaces into markup, etc.
Text Segmentation - Enable SRX rules for text segmentation and elect to always split text at line breaks.

Do Not Translate

The Do Not Translate tab may be used to specify what colors, segments, and words within the PDF file should not be translated.

Colors - Select one or more text colors to be excluded when text is extracted for translation.

The following file extensions are supported when setting up file format configurations for PDF files: .pdf.

Segments - Mark specific text segments as translatable (extracted) or not translatable (ignored) when found by the system. Regular expressions may be entered in the system to protect entire segments or just terms within the file. These segments or terms will not be extracted for translation and be taken into account during the wordcount step. In the translation editor, they will appear as tags and can be used to protect parts of texts that should not be translated, but should still appear in the translated document. A good example is entering terms or regular expressions to protect brand names or confidential content like software codes.
Words or Terms - Configure single words, terms, or segments to be excluded from the translation. Any text captured by regular expressions is converted into markup and not modified. A description may be added to avoid confusion. When no description is added, the original text will appear upon hovering over the markup. This feature is useful to protect certain terms that must not be modified by the translator. For instance the company name, or technical references, etc...

Fonts

The Fonts tab may be used to replace fonts for certain language-specific translation scenarios.

Replace font when translating into Japanese, Chinese, Korean, and similar - Some fonts cannot correctly display Asian characters. If they are used in a document translated into Japanese the final document will be unreadable. This option forces the use of a user-defined font for the Asian texts if there is not any compatible font defined.
Replace font when translating into Arabic, Hebrew, Farsi, and other "complex script" languages - Some fonts cannot correctly display complex script language characters. If they are used in a document translated into Arabic (for example) the final document will be unreadable. This option forces the use of a user-defined font for the complex script texts if there is not any compatible font defined.
Replace all fonts in the translated document - An option for translating between languages where different scripts and fonts exist. In some cases, the used fonts are not compatible with the target language thus causing the text to be unreadable. This option informs the system to replace all fonts so that the translated document is easy to read.

For example, an OCR conversion can create a PDF file consisting of extensive font changes. As a result, the document in Wordbee will contain a vast amount of markups. This option may be used to override the fonts of the document with a single user-defined font to drastically reduce the markups.

Reduce Markup

The Reduce Markup tab may be used to substantially reduce markup for improved translation memory hits and less work for your translators. Please note that removing markup can result in minor differences between fonts and text styles. Formatting changes in a PDF document are represented in Wordbee by markups (aka tags).

Remove irrelevant font or style changes - These options are designed to eliminate any font or style changes, which are irrelevant and cannot be distinguished visually.

Formatting changes that have been applied to whitespaces - An extensive amount of markups can complicate the translation work. This option permits to ignore formatting changes applied on spaces.
Formatting changes to the text that does not contain letters or digits - Just as above, text not containing letters or digits that have experienced a formatting change will be represented by markups (tags). This option prevents excessive markup for these types of formatting changes.
Ignore Asian font changes if the source language is not Asian - Do not tick if the original text contains Asian characters.
Asian or Arabic/Hebrew/Farsi font changes when the source language is not the same. <Need Description>
OCR (Optical Character Recognition) noise reduction - A document created with an OCR tool may contain a vast amount of formatting changes. One example of this is when the character spacing is adjusted between EACH character by the OCR tool. This option permits to ignore the formatting changes generally applied by OCR, which can be ignored without losing the general appearance of the document.

Visually reduce markup in the translation editor - Preserves the original styles and fonts to reduce markup in the translation editor.

Learn more

To learn more about working with file format configurations, see the following pages: