Web Page Configuration Options

Configure how Wordbee Translator extracts and handles content from web page files. These settings control encoding, HTML tag behavior, attribute translation, content exclusion, and more.

The web page configuration applies to the following file extensions: .htm, .html, .xhtml, .htmls, .php, .php2, .php3, .php4, .php5, .php6, .phtml, .csm, .jsp, .ahtm, .ahtml.

To access web page configurations:

Go to Settings > Customization > Document Formats.
Select Web Pages from the format drop-down menu.
Click on a configuration profile to view it, or click Edit to modify it.

To learn more about working with file format configurations, see:

General Tab

The General tab controls encoding, HTML code handling, HTML attribute display, content exclusion, and text segmentation.

Encoding

Setting	Description
Default encoding	The character encoding used to read the file. Defaults to UTF-8. Other options include Windows, Macintosh, and ASCII encodings.
Convert incompatible characters	When enabled, characters not compatible with the target encoding are converted into entity references.

HTML Code

These settings control how the HTML markup is presented to translators.

Setting	Description
Hide beginning/ending whitespaces	Hides leading and trailing whitespace characters from the translator view.
Compress sequences of whitespaces	Replaces multiple consecutive whitespace characters with a single space.
Replace   by blanks	Converts non-breaking space entities into regular blank spaces.
Show preceding/trailing HTML tags	Displays the HTML tags that surround the translatable text.
Entity references display	Controls how HTML entity references are shown to translators.

HTML Attributes

These settings control how the content of HTML attributes is displayed to translators when attributes are marked as translatable.

Setting	Description
Show beginning/end whitespaces	Shows leading and trailing whitespace in attribute values.
Compress sequences of whitespaces	Replaces multiple consecutive whitespace characters with a single space in attribute values.
Entity references display	Controls how entity references inside attribute values are shown to translators.

Exclude Content

Use this section to exclude specific content from translation. Enter text segments or regular expressions. When a match is found, you can mark the segment as:

Not translatable — the segment is hidden from translators.
Translatable — the segment is shown for translation.
Potentially not translatable — the segment is shown but flagged for review.

Text Segmentation

Setting	Description
Enable SRX rules	When enabled, text is segmented using SRX rules.
Split text at line breaks	When enabled, a new segment starts at each line break.

Server and Client Side Code Tab

The Server and Client Side Code tab controls how the system handles code sections (JavaScript, PHP, and other server-side code) embedded in web pages.

Extract Quoted Strings

Web pages often contain JavaScript or server-side code (such as PHP) with quoted strings that may need translation. Enable this option to extract those strings automatically.

Setting	Description
Extract quoted strings	When enabled, quoted strings inside code sections are extracted for translation.
Compress sequences of whitespaces	Replaces multiple whitespace characters with a single space inside extracted strings.

Exclude Quoted Strings

Use this section to prevent specific quoted strings from being extracted. Enter text segments or regular expressions. When a match is found, the segment can be marked as translatable or not translatable.

Include or Exclude Additional Content

Use regular expressions to extract text inside code sections that goes beyond quoted strings. The expressions can capture any content.

Note

The regex must contain capture groups named pattern1, pattern2, etc. For example: @(?<pattern1>.*?)@ extracts any text delimited by @.

HTML Tags and Attributes Tab

The HTML Tags and Attributes tab controls which HTML attributes are extracted for translation, which tags are treated as inline (non-breaking), and which tags preserve whitespace.

Translatable Attributes

This grid defines which HTML attribute values are extracted for translation. By default, common attributes such as alt, title, placeholder, content, and value are pre-configured.

Each row in the grid specifies a rule with the following columns:

Column	Description
Attribute	The name of the HTML attribute (for example, content, alt, title).
Value	An optional filter for the attribute’s own value. Leave empty to match all values of the attribute. When a value is specified, the rule applies only when the attribute contains that exact value. Displays (any) when no filter is set.
Parent tag	An optional filter for the parent HTML tag. For example, setting this to meta restricts the rule to attributes within <meta> tags only.
Advanced condition	An optional condition based on a sibling attribute. For example, you can require that a sibling attribute name has the value description for the rule to apply.
Translate	Set to Yes to extract the attribute value for translation, or No to exclude it.
Use regex	When enabled, all text fields in the row (attribute name, value, parent tag, and condition) are interpreted as regular expressions instead of exact matches.

To add a translatable attribute rule:

Click Edit in the upper right corner.
Enter the Attribute name (for example, content).
Optionally enter a Value to filter by (for example, HELP).
Optionally enter a Parent tag (for example, meta).
Set Translate to Yes or No.
Click Save to apply the configuration.

Filtering by Attribute Value

The Value column allows you to target specific attribute values instead of applying a rule to every instance of an attribute. This is useful when your HTML contains the same attribute name with different values that require different handling.

How value matching works:

No value specified (empty): The rule applies to all instances of the attribute, regardless of its value. This is the default behavior.
Value specified: The rule applies only when the attribute’s value matches the specified text exactly. Matching is case-sensitive.
Regex enabled: When Use regex is enabled for the row, the value is treated as a regular expression pattern.

Example: Translating only specific meta tag content

Given the following HTML:

HTML

<meta name="description" content="About us">
<meta name="keywords" content="HELP">

To translate only the content attribute of the description meta tag and exclude keywords:

Attribute	Value	Parent tag	Translate
content	(empty)	meta	Yes
content	HELP	meta	No

The first row marks all content attributes within <meta> tags as translatable. The second row overrides this for the specific value HELP, excluding it from translation. The result: About us is extracted for translation, while HELP is not.

Value-specific rules take precedence

When both a general rule (no value filter) and a value-specific rule exist for the same attribute, the value-specific rule always wins, regardless of row order in the grid.

Example: Using regex to match a pattern

To translate only title attributes whose values start with translate:

Attribute	Value	Parent tag	Use regex	Translate
title	^translate.*	(empty)	Yes	Yes

This matches <p title="translate-me"> but not <p title="do-not-translate">.

Non-Breaking Tags

Non-breaking (inline) tags appear within translatable text rather than splitting it into separate segments. These are typically links, images, or text formatting elements.

The following tags are pre-configured as non-breaking: a, acronym, b, big, blink, br, cite, code, dfn, em, font, i, iframe, img, kbd, s, small, span, strike, strong, sub, sup, tt, u, var, ruby, rt, rc, rp, rbc, rtc, asp:label.

You can add additional non-breaking tags if needed for your content. Tag names are case-insensitive.

Whitespace Preserving Tags

Whitespace is generally collapsed in HTML. Tags listed in this section are exceptions: whitespace inside them is preserved during parsing.

The following tags are pre-configured: pre, script, style.

This section is read-only and cannot be modified.

CMS Specific Settings Tab

The CMS Specific Settings tab controls how the parser handles custom markup used by content management systems such as WordPress or Drupal.

Many CMS platforms use "shortcodes" — special markup enclosed in square brackets — within HTML content. For example: [image title="This is a text"]. Shortcodes are markup and do not need translation.

Setting	Description
Content between double brackets is considered markup	When enabled, text enclosed in square brackets (shortcodes) is treated as non-translatable markup.

Tip

If certain shortcode attributes need translation (for example, the title attribute in [image title="..."]), add those attribute names in the Translatable Attributes grid on the HTML Tags and Attributes tab.

Post-processing Tab

The Post-processing tab defines regex-based find and replace rules that are applied to the translated output file during reconstruction. Use these rules to adjust markup, inject attributes, or rewrite CSS for specific target languages (for example, to add dir="rtl" and lang="ar" to HTML output when translating into Arabic).

Rules run every time the translated file is generated: both when previewing a download and when creating a delivery. They are applied sequentially, in the order listed.

The Post-processing tab showing five example RTL rules targeting Arabic, Hebrew, Farsi, and Urdu output.

Post-processing Rules

Each row in the grid defines one rule with the following columns:

Column	Description
On	Enables or disables the rule. Set to Yes to apply the rule, or No to skip it without deleting it.
Language pattern	A regular expression matched against the target language code. Leave empty to apply the rule to all target languages. For example, `^(ar\|he\|fa\|ur)` applies the rule only when the target is Arabic, Hebrew, Farsi, or Urdu.
Search regex	The regular expression pattern to find in the output text. Use capturing groups (parentheses) to reference parts of the match in the replacement.
Replacement	The text that replaces each match. Reference capture groups from the search pattern with `$1`, `$2`, and so on.

To add a post-processing rule:

Click Edit in the upper right corner.
Add a new row to the Post-processing rules grid.
Set On to Yes.
Optionally enter a Language pattern to limit the rule to specific target languages.
Enter the Search regex to match text in the output file.
Enter the Replacement text.
Click Save to apply the configuration.

When Rules Are Applied

Rules run on the fully reconstructed output file, so they can target any part of the document — including CSS declarations inside <style> blocks, inline attributes, or text content. They are applied both when generating a preview download and when creating a delivery.

Note

Post-processing runs after translation and reconstruction. It does not affect the content presented to translators in the Editor, only the final output file.

Example: Right-to-Left (RTL) Output for Arabic and Hebrew

When an HTML file is translated from a left-to-right source language (such as English) into a right-to-left language, the output often needs two kinds of adjustment:

The <html> tag should include dir="rtl" and a matching lang attribute.
Explicit direction: ltr and text-align: left CSS declarations should be flipped to their RTL equivalents.

The ruleset below is a starting point for Arabic, Hebrew, Farsi, and Urdu output. It is not a ready-to-use solution: each target document has its own markup and CSS structure, and the rules may need to be adapted, extended, or removed depending on the template you are translating.

On	Language pattern	Search regex	Replacement
Yes	`^(ar\|he\|fa\|ur)`	`(<html\b(?![^>]\sdir\s=)`	`$1 dir="rtl"`
Yes	`^(ar\|he\|fa\|ur)`	`(<html\b[^>])\slang\s=\s["'][^"']*["']`	`$1 lang="ar"`
Yes	`^(ar\|he\|fa\|ur)`	`(<html\b(?![^>]\slang\s=)`	`$1 lang="ar"`
Yes	`^(ar\|he\|fa\|ur)`	`direction\s:\sltr`	`direction:rtl`
Yes	`^(ar\|he\|fa\|ur)`	`text-align\s:\sleft\s*;`	`text-align:right;`

Warning

Always verify the generated output against your actual source files before using post-processing rules in production. Regex replacements run against the entire document, so overly broad patterns can produce unintended changes. Treat the ruleset above as an example to adapt, not as a finished configuration.

Learn More

Web Pages — overview and supported file extensions
Web Page Questions and Answers — common configuration scenarios
HTML Content Configurations — configuring HTML extraction for non-HTML file formats (XLIFF, CSV)
Working with file formats — general guidance on file format configurations