Transformations > Chunking transformation > Text processing functions

Text processing functions

Text processing functions make text data cleaner and semantically more consistent for vector embedding by focusing on words that are informative to the meaning of the text and by reducing variability to aid NLP. In RAG use cases, text processing ensures that text is clean, consistent, and easily comparable to user queries.

Text processing functions can clean text by removing noise such as whitespace and diacritics, and they can convert text to a standard format by lemmatizing words to their base forms.

You can use the following text processing functions:

Cleanse text: Cleanse the text by removing redundant whitespace and sets of dots and by converting letters to lowercase.
Remove diacritics: Removes diacritics including accents and other marks that change a letter's pronunciation. For example, café becomes cafe.
Check spelling: Checks for spelling errors based on the context of the data and corrects them.
Lemmatize: Converts words to their base form. For example, better becomes good and running becomes run.; Lemmatization preserves the semantic accuracy of the data, so it's useful for sentiment analysis and machine translation.
Remove stop words: Removes common stop words like pronouns, articles, prepositions, and conjunctions. For example, This is a sample text becomes sample text.

Converting words to lowercase and removing stop words is a simple and effective way to reduce data complexity that applies to most NLP tasks.