# Data Lakes## OverviewData Lakes are document repositories optimized for AI retrieval. They store, process, and index your content using embedding models, enabling RAG (Retrieval-Augmented Generation) capabilities in your chatbots.## Key Concepts### EmbeddingsMathematical representations of text that capture semantic meaning. Similar concepts have similar embeddings, enabling semantic search.### ChunkingDocuments are split into smaller pieces (chunks) for processing. Chunk size and overlap affect retrieval quality:* Smaller chunks: More precise retrieval, may lose context
- Larger chunks: Better context preservation, less precise
- Overlap: Ensures continuity between consecutive chunks### RAG (Retrieval-Augmented Generation)A technique where AI retrieves relevant information from your documents before generating responses, enabling accurate answers based on your specific content.## Creating a Data Lake1) Click New Data Lake
- Enter a name
- Select an embedding model (determines how content is vectorized)
- Configure chunking:
- Chunk size: 800-1600 characters (use slider)
- Overlap: 100-400 characters (use slider)
- Watch the visual preview to understand the effect
- Add optional metadata:
- Category name
- Group name
- Tags
- Security level
- Choose content source:
- File Upload: Drag and drop or use file picker
- Web Crawler: Enter URL, set depth and max pages
- Click Create Data Lake## Adding ContentTo add documents to an existing Data Lake:1) Click Add Documents on the Data Lake card
- Choose your method:
- File upload: Drag and drop or browse for files
- Video URL: Paste YouTube or direct video links (transcripts are extracted)
- Click Upload### Supported Formats* PDF documents
- Word documents (.doc, .docx)
- Text files (.txt)
- YouTube videos (transcript extraction)
- And more## Document GroupsOrganize content within Data Lakes using groups:* Click Add Groups to attach existing document groups
- View file counts per group on the badges
- Click the X on any group badge to remove it## Monitoring IngestionDuring document processing:* Status badges update in real-time
- Progress bars show completion percentage
- The page auto-refreshes during active ingestion
- Click View Details for comprehensive management options## Chunking Best Practices| Use Case | Chunk Size | Overlap |
| ----------------------- | ---------- | ------- |
| Technical documentation | 1200-1600 | 200-300 |
| Q&A / FAQ content | 800-1000 | 100-150 |
| Long-form articles | 1400-1600 | 300-400 |
| Mixed content | 1000-1200 | 200 |---Related: Projects | Permission Levels | Chatbots | Deidentification