How CAT-Tools Divide a Text into Segments

As a rule, one segment = one sentence. However, the priority of CAT-tools is two invisible non-printable characters.

Upload a file to a CAT, the system extracts a text, divides it into segments, and then a project is ready for translation. It is important to note that the CAT software can split one sentence into two or more segments. It will complicate the work of translators, editors, and proofreaders. Incorrect segmentation often adversely affects translation quality.

The CAT tool divides the text into segments, focusing on punctuation marks and unprintable characters.

As a rule, one segment = one sentence. Five punctuation marks indicate the end of a sentence:

  • Dot “.”
  • Exclamation mark “!”
  • Question mark “?”
  • Colon “:”
  • Semicolon “;”

However, the priority of CAT (CAT programs, CAT tools) is two invisible non-printable characters, which divide the text into segments:

  • Paragraph mark “¶”
  • Carriage return arrow “

When the display mode of non-printable characters is disabled, the text layout looks correct, which means the file is ready for translation. However, when we enable the display mode of non-printable characters, the situation may change. We might find paragraph or line break characters inside the sentence. The CAT tool will divide this sentence into two or more separate segments. This often leads to incorrect translation because the translator doesn’t know how many segments the sentence is divided into, numbers of required segments, and the sequence of splitting. The paragraph and line break characters are in blue in the screenshot:

Text in the example is the result of automatic recognition in FineReader. The CAT tool will create a separate segment for each line, although there are three sentences on the screenshot. The result is nine segments instead of three. These pieces of text are difficult to translate to get the original sentence translated correctly.

You don’t know which files are ready for translation and which ones require recognition? Read the article about file formats for translation.

Yevhen Venherenko

Yevhen Venherenko

Leave a comment