In an exciting development, NVIDIA has unveiled a comprehensive blueprint for building an enterprise-scale multimodal document retrieval pipeline. This initiative leverages the company’s NeMo Retriever and NIM microservices, aiming to revolutionize how businesses extract and utilize vast amounts of data from complex documents, according to NVIDIA Technical Blog.
Harnessing Untapped Data
Every year, trillions of PDF files are generated, containing a wealth of information in various formats such as text, images, charts, and tables. Traditionally, extracting meaningful data from these documents has been a labor-intensive process. However, with the advent of generative AI and retrieval-augmented generation (RAG), this untapped data can now be efficiently utilized to uncover valuable business insights, thereby enhancing employee productivity and reducing operational costs.
The multimodal PDF data extraction blueprint introduced by NVIDIA combines the power of the NeMo Retriever and NIM microservices with reference code and documentation. This combination allows for accurate extraction of knowledge from massive volumes of enterprise data, enabling employees to make informed decisions swiftly.
Building the Pipeline
The process of building a multimodal retrieval pipeline on PDFs involves two key steps: ingesting documents with multimodal data and retrieving relevant context based on user queries.
Ingesting Documents
The first step involves parsing PDFs to separate different modalities such as text, images, charts, and tables. Text is parsed as structured JSON, while pages are rendered as images. The next step is to extract textual metadata from these images using various NIM microservices:
- nv-yolox-structured-image: Detects charts, plots, and tables in PDFs.
- DePlot: Generates descriptions of charts.
- CACHED: Identifies various elements in graphs.
- PaddleOCR: Transcribes text from tables and charts.
After extracting the information, it is filtered, chunked, and stored in a VectorStore. The NeMo Retriever embedding NIM microservice converts the chunks into embeddings for efficient retrieval.
Retrieving Relevant Context
When a user submits a query, the NeMo Retriever embedding NIM microservice embeds the query and retrieves the most relevant chunks using vector similarity search. The NeMo Retriever reranking NIM microservice then refines the results to ensure accuracy. Finally, the LLM NIM microservice generates a contextually relevant response.
Cost-Effective and Scalable
NVIDIA’s blueprint offers significant benefits in terms of cost and stability. The NIM microservices are designed for ease of use and scalability, allowing enterprise application developers to focus on application logic rather than infrastructure. These microservices are containerized solutions that come with industry-standard APIs and Helm charts for easy deployment.
Moreover, the full suite of NVIDIA AI Enterprise software accelerates model inference, maximizing the value enterprises derive from their models and reducing deployment costs. Performance tests have shown significant improvements in retrieval accuracy and ingestion throughput when using NIM microservices compared to open-source alternatives.
Collaborations and Partnerships
NVIDIA is partnering with several data and storage platform providers, including Box, Cloudera, Cohesity, DataStax, Dropbox, and Nexla, to enhance the capabilities of the multimodal document retrieval pipeline.
Cloudera
Cloudera’s integration of NVIDIA NIM microservices in its AI Inference service aims to combine the exabytes of private data managed in Cloudera with high-performance models for RAG use cases, offering best-in-class AI platform capabilities for enterprises.
Cohesity
Cohesity’s collaboration with NVIDIA aims to add generative AI intelligence to customers’ data backups and archives, enabling quick and accurate extraction of valuable insights from millions of documents.
Datastax
DataStax aims to leverage NVIDIA’s NeMo Retriever data extraction workflow for PDFs to enable customers to focus on innovation rather than data integration challenges.
Dropbox
Dropbox is evaluating the NeMo Retriever multimodal PDF extraction workflow to potentially bring new generative AI capabilities to help customers unlock insights across their cloud content.
Nexla
Nexla aims to integrate NVIDIA NIM in its no-code/low-code platform for Document ETL, enabling scalable multimodal ingestion across various enterprise systems.
Getting Started
Developers interested in building a RAG application can experience the multimodal PDF extraction workflow through NVIDIA’s interactive demo available in the NVIDIA API Catalog. Early access to the workflow blueprint, along with open-source code and deployment instructions, is also available.
Image source: Shutterstock