In advanced mode, the Source transformation can read text from PDF files.
The Source transformation extracts the full structure of the document including text, tables, headings, and metadata. You can extract text from documents that have different document structures, such as invoices and reports, while preserving the order of the text.
To read a PDF, use the Source tab and select Document. Data Integration sets the input type to PDF automatically.
To read a directory of PDFs, change the Source Type in the advanced properties to Directory. For the File Name Override, enter *.pdf.
The Fields tab displays fields to store the text, file path, file type, and file name for each PDF.
You can pass the text to downstream Chunking and Vector Embedding transformations to build a RAG ingestion pipeline, or you can process the text, create structured data from it, and write it to a JSON file.