Inventaire


Site en français

Regular expressions in Natural Language Processing

Units : TRADITAL : Research centre for Translation, interpretation, language didactics, and natural language processing | ULB778



Description :


A regular expression is a character string or formula (also known as a pattern ) that represents a class of character strings. For
example, the expression “d{2}-d{2}-d{4}” represents any date of the type 11-05-2024 or 07-12-2023. The much more complex
expression “^[^s@]+@[^s@]+.[^s@]+$” can be used to check the validity of an e-mail address. These regular expressions, often
abbreviated as REGEX, are available in many tools used in translation and NLP, from the most common, such as word processors, to the
most sophisticated, such as translation memories or corpora, and are, of course, supported by programming languages like Python.

Often ignored, poorly mastered or under-utilized, they facilitate the editor's task, whether in the word processor or in the
translation environment (for example, by transforming dates, as in the example above) or checking the presence or absence of spaces
(possibly non-breaking spaces) in front of a number. In web programming, they are regularly used to check the conformity of an
e-mail address or to filter out potentially fraudulent e-mail addresses or URLs. In NLP, they can be used to perform advanced searches
(and replacements) on both forms (“all French adverbs in -ment or -mment”) and syntactic structures (“all French noun
phrases where two adjectives precede the noun”). They thus form the basis of automatic language processing. In translation, they can
be used to parameterize the translation memory system to handle non-standard files. Alternatively, they can be used to transform a
file from one format to another, to hide all untranslatable strings at once, or to extract relevant information. Properly used,
they save tedious, time-consuming file preparation, revision and ad hoc modifications.

The aim of this research is to describe the uses to which REGEX can be put, in the various activities linked to translation:
translation proper, file preparation, revision, corpus management, Python programming for NLP, and in different tools: text
processing and editing, translation memories, corpora, Python.

The aim of this research is to provide translation and NLP professionals with an overview of REGEX syntax (and its variants) and
its use in the various tools at their disposal. This description is intended to be practical, educational and rich in concrete
examples.

List of persons in charge :


  • MERTEN Pascaline