Project Link: https://dev.azure.com/clarity-ventures/AI-Projects/_git/AI-DataFormatter
At Clarity Ventures, we understand the importance of clean, consistent, and well-structured technical documentation. As our internal Nuclino has grown in complexity and scale, ensuring the quality and coherence of the content has become increasingly challenging. To address this, we've developed AI-DataFormatter, a data processing tool that intelligently formats and prepares our wiki data for training a custom chatbot. This article explores the key features and technical implementation of AI-DataFormatter.
Processing large volumes of technical documentation requires significant computational resources. AI-DataFormatter addresses this challenge by leveraging parallel processing across multiple GPUs. The system is built around a cluster of vLLM servers, each associated with a specific GPU. When a processing request is received, AI-DataFormatter intelligently distributes the workload across these servers, ensuring optimal utilization of computational resources.
The parallel processing workflow follows these steps:

Figure 1: Sample response and parsing
Maintaining contextual coherence when splitting large documents into smaller chunks is a key challenge. Naively splitting by a fixed word count or character limit can lead to fragmented sentences, broken code snippets, and loss of semantic continuity. AI-DataFormatter addresses this with an intelligent text chunking algorithm that considers the semantic structure of the document:
MaxChunkSize, MinChunkSize, and tuning OverlapSize for better coherence between chunks. To maintain context between chunks, a certain amount of overlap is retained from the previous chunk, chosen to be a complete semantic unit such as a full sentence or code block.To improve the chatbot's knowledge and adaptability, AI-DataFormatter automatically generates diverse question-answer pairs from the cleaned wiki content. It creates various question types, allows customization of generation parameters, integrates with the data pipeline to populate the QA database, and utilizes concurrent generation for efficiency.
For example, from this cleansed text:
"AI-DataFormatter is a powerful tool designed to format and prepare company data for training a custom chatbot. It leverages intelligent text processing techniques, parallel computing, and seamless database integration to ensure the chatbot is trained on high-quality, contextually relevant data. The modular architecture and use of standard libraries and frameworks make the tool maintainable, scalable, and easily extensible."
The module could generate these synthetic QA pairs:
What does AI-DataFormatter do? Formats company data for chatbot training using intelligent techniques and parallel computing.
How does AI-DataFormatter ensure high-quality chatbot training data? Leverages text processing, parallel computing and database integration for contextually relevant data.
What makes AI-DataFormatter maintainable and scalable? Modular architecture and use of standard libraries and frameworks.
Exposing the chatbot to a wide range of potential queries during training through synthetic data generation significantly improves its ability to provide relevant, context-aware responses without manual effort. This intelligent approach to optimizing data preparation is a key innovation of AI-DataFormatter.
To further enhance the quality and accessibility of our documentation, AI-DataFormatter includes Multimedia Content Processing. This feature automatically detects and processes images embedded in markdown files, converting them into detailed textual descriptions using vision models like MiniCPM-V. This ensures that visual content is accurately represented and accessible, improving the overall user experience.
By integrating secure Nuclino API access and vision model processing, AI-DataFormatter preserves the context of images, generates automatic alt text when needed, and enriches documentation with meaningful image descriptions. This advancement makes it easier for the chatbot to understand and relay visual information, providing users with more complete and context-aware responses.
For instance, when processing the image: clarity-logo.png
The generated description is:
"The image features a logo that is predominantly in shades of blue and black. The central element of the logo is an uppercase letter "A", which is stylized to resemble a mountain peak. Atop this "A" stands a silhouette of a person, suggesting themes of achievement or reaching new heights. To the right of the "A", there's a smaller triangle, also in blue, adding balance to the design. Below these graphic elements, the word "CLARITY" is written in lowercase letters. The overall impression is one of ascension, clarity, and perhaps a journey towards enlightenment or success."
We provide a specific prompt while dissecting the cleaning rules and JSON result format, identifying technical elements to preserve, and specifying content cleanup and formatting rules. This involves analyzing prompt/completion examples, iterating on prompt design for optimal results, fine-tuning vLLM parameters such as temperature and top_p, and evaluating the impact of prompt changes on real datasets.
AI-DataFormatter incorporates a robust error handling mechanism to deal with potential issues during processing. Key components include:

Figure 2: Console logging
We see in this figure how the tool keeps track of everything during processing. It logs real-time updates like how many items were processed, success rates, and any skipped items. You can also see how it splits large tasks into chunks, uses GPUs for speed, and handles errors if something goes wrong. It’s a clear way to monitor progress and quickly spot issues, making the whole process smooth and efficient.
As we continue to develop AI-DataFormatter, we will focus on the following key areas:
AI-DataFormatter is a powerful and efficient tool designed to format and prepare our company's Nuclino data for training a custom chatbot. By leveraging intelligent text processing techniques, parallel computing, and seamless database integration, it ensures that the chatbot is trained on high-quality, contextually relevant data. The modular architecture and use of standard libraries and frameworks make the tool maintainable, scalable, and easily extensible.
Developing AI-DataFormatter in-house represents a significant step towards building a highly accurate and efficient chatbot that can aid employees in accessing and navigating the vast knowledge stored in our company wiki. This tool lays the foundation for more advanced AI applications and showcases our commitment to leveraging cutting-edge technologies to drive productivity and innovation.