Transformations > Chunking transformation > Text processing functions
  

Text processing functions

Text processing functions make text data cleaner and semantically more consistent for vector embedding by focusing on words that are informative to the meaning of the text and by reducing variability to aid NLP. In RAG use cases, text processing ensures that text is clean, consistent, and easily comparable to user queries.
Text processing functions can clean text by removing noise such as whitespace and diacritics, and they can convert text to a standard format by lemmatizing words to their base forms.
You can use the following text processing functions:
Cleanse text
Cleanse the text by removing redundant whitespace and sets of dots and by converting letters to lowercase.
Remove diacritics
Removes diacritics including accents and other marks that change a letter's pronunciation. For example, café becomes cafe.
Check spelling
Checks for spelling errors based on the context of the data and corrects them.
Lemmatize
Converts words to their base form. For example, better becomes good and running becomes run.
Lemmatization preserves the semantic accuracy of the data, so it's useful for sentiment analysis and machine translation.
Remove stop words
Removes common stop words like pronouns, articles, prepositions, and conjunctions. For example, This is a sample text becomes sample text.
Converting words to lowercase and removing stop words is a simple and effective way to reduce data complexity that applies to most NLP tasks.