Initial commit

2025-12-11 14:43:16 +01:00
commit 7ef62972f3
36 changed files with 23641 additions and 0 deletions
--- a/docs/Library/_category_.json
+++ b/docs/Library/_category_.json
@@ -0,0 +1,8 @@
+{
+  "label": "Library",
+  "position": 2,
+  "link": {
+    "type": "generated-index",
+    "description": "Learn about the core components of Evie's Library system"
+  }
+}
--- a/docs/Library/library_basics.md
+++ b/docs/Library/library_basics.md
@@ -0,0 +1,177 @@
+---
+id: library_basics
+title: Evie's Library Basics
+description: Understanding catalogs, documents, and document versions in Evie's Library
+sidebar_label: Library Basics
+sidebar_position: 1
+---
+
+# Evie's Library Basics: Catalogs, Documents & DocumentVersions
+
+## Overview
+Evie's Library is an intelligent information storage and retrieval system. It organizes your business information semantically, making it easily accessible and searchable. The library can handle various types of content including HTML pages, PDF documents, and other document formats, storing them in a way that preserves their meaning and context.
+
+## Library Structure
+The library is organized into sections called Catalogs, which help group related information together. Each catalog contains Documents, which can have multiple versions to track changes over time.
+
+```mermaid
+classDiagram
+    class Catalog {
+        +name
+        +description
+        +type (Standard/Dossier)
+        +min_chunk_size
+        +max_chunk_size
+        +user_metadata
+    }
+    class Document {
+        +name
+        +valid_from
+        +valid_to
+    }
+    class DocumentVersion {
+        +url
+        +file_type
+        +language
+        +user_context
+        +processing_status
+    }
+    
+    Catalog "1" -- "*" Document : contains
+    Document "1" -- "*" DocumentVersion : has versions
+    note for Catalog "Configurable based on type, e.g. Dossier has tagging fields"
+    note for DocumentVersion "Processed asynchronously & Generates semantic chunks"
+```
+
+### Catalogs
+A catalog is a container for related documents. You can create different catalogs to organize your information in ways that make sense for your business. Each catalog has the following key features:
+
+- **Name and Description**: Helps identify the catalog's purpose
+- **Type**: Determines how information is organized within the catalog
+- **Chunk Size Settings**: Controls how documents are processed for optimal retrieval
+- **Custom Metadata**: Allows adding business-specific information
+
+Catalog Types
+
+Standard Catalog
+
+Basic catalog type for general document storage
+All stored information is treated as a unified collection
+Suitable for most general knowledge bases
+
+
+Dossier Catalog
+
+Advanced catalog type with tagging capabilities
+Allows organizing documents with custom tags
+Requires configuration of tagging fields during catalog creation
+Example tagging field configuration:
+
+
+
+```
+{
+  "tagging_fields": {
+    "company": {
+      "type": "string",
+      "required": true,
+      "description": "Company name"
+    },
+    "year": {
+      "type": "integer",
+      "required": false,
+      "max_value": 2100,
+      "min_value": 1900,
+      "description": "Document year"
+    },
+    "document_type": {
+      "type": "enum",
+      "required": false,
+      "description": "Type of document",
+      "allowed_values": [
+        "quarterly_report",
+        "annual_report",
+        "presentation",
+        "press_release"
+      ]
+    },
+    "confidentiality": {
+      "type": "enum",
+      "required": false,
+      "description": "Document confidentiality level",
+      "allowed_values": [
+        "public",
+        "internal",
+        "confidential"
+      ]
+    }
+  }
+}
+```
+
+This configuration defines:
+
+- A required company name field as text
+- An optional year field (between 1900 and 2100)
+- An optional document type selection from predefined options
+- An optional confidentiality level selection
+
+### Documents and Versioning
+Documents in Evie's Library are managed with version control to ensure information stays current:
+
+- **Basic Document Properties**:
+  - Name
+  - Validity period (optional)
+  - Associated metadata
+
+- **Document Versions**:
+  - Track changes in document content
+  - Store the actual content and its processing state
+  - Support multiple file formats
+  - Can be automatically updated for URL-based sources
+  - Include language information
+
+### Multilingual Support
+Evie's Library has built-in multilingual capabilities:
+- Documents in different languages can coexist in the same catalog
+- Information can be retrieved regardless of the language it was stored in
+- Questions can be asked in any supported language
+- No need to store multiple translations of the same document
+
+### Document Processing
+When documents are added to the library:
+
+1. They are automatically processed to understand their content
+2. Processing happens in the background without interrupting your work
+3. Documents are split into semantic chunks for optimal understanding
+4. The latest version becomes available once processing is complete
+
+### Using the Library
+To make the most of Evie's Library:
+
+1. **Organize Your Information**:
+   - Create catalogs based on your business needs
+   - Choose between Standard and Dossier catalogs based on whether you need tagging
+   - Add relevant metadata to help organize information
+
+2. **Add Documents**:
+   - Provide URLs for documents whenever possible (recommended method)
+     - URLs allow automatic document refreshing
+     - Ensures your library stays up-to-date with source changes
+     - Maintains version history automatically
+   - File uploads are supported as an alternative
+     - Use when URL access isn't available
+     - Note: Updates will require manual re-upload
+
+3. **Maintain Your Library**:
+   - URL-based documents can be automatically refreshed to stay current
+   - Monitor processing status for new additions
+   - Manage document validity periods if needed
+   - For uploaded files, consider periodically checking if updates are needed
+
+## Best Practices
+- Group related documents in the same catalog
+- Use meaningful names and descriptions for catalogs and documents
+- Add relevant metadata to make information more discoverable
+- For Dossier catalogs, establish consistent tagging conventions
+- Prefer URL-based documents over file uploads to enable automatic updates
--- a/docs/Library/processors.md
+++ b/docs/Library/processors.md
@@ -0,0 +1,230 @@
+---
+id: processors
+title: Understanding Processors
+description: Learn how processors handle different types of files in Evie's Library
+sidebar_label: Processors
+sidebar_position: 2
+---
+
+# Understanding Processors
+
+## Overview
+
+Processors are essential components in Evie's Library that determine how different types of files (HTML, PDF, audio) 
+are processed and stored. Each processor is specifically configured for a catalog and handles one or more file types, 
+ensuring that content is properly extracted, chunked, and stored according to your business needs.
+
+We will use the term *Chunks* or *Chunking* throughout the remainder of this chapter. Chunking is the process of 
+splitting large files into smaller pieces for storage. This enables Evie to more precisely capture the semantics of each
+piece, and it enables Evie to return smaller and more specific pieces of content related to a question that is asked.
+
+Evie has some proprietary algorithms to intelligently process and chunk files, to optimize the knowledge in her Library.
+
+The relationship between Catalogs and Processors is shown in this diagram:
+
+```mermaid
+classDiagram
+    class Catalog {
+        +id: Integer
+        +name: String
+        +description: Text
+        +type: String
+        +min_chunk_size: Integer
+        +max_chunk_size: Integer
+        +user_metadata: JSON
+        +system_metadata: JSON
+        +configuration: JSON
+    }
+    class Processor {
+        +id: Integer
+        +name: String
+        +description: Text
+        +type: String
+        +sub_file_type: String
+        +tuning: Boolean
+        +user_metadata: JSON
+        +system_metadata: JSON
+        +configuration: JSON
+    }
+    
+    Catalog "1" -- "*" Processor : has
+    note for Processor "Unique per catalog\nConfigured for specific file types"
+```
+
+## Key Concepts
+
+### Processor Basics
+
+- **Catalog-Specific**: Each processor is uniquely defined for a specific catalog and cannot be shared across catalogs
+- **File Type Support**: Currently supports HTML, PDF, Markdown, Docx and audio files
+- **Optional Usage**: If no processor is specified for a file type, files of that type will not be processed
+- **File Size Limits**: Maximum file size of 50MB for all file types
+
+### Sub-File Types
+
+Sub-file types allow you to process different document types within the same catalog using different rules. For example:
+- Quarterly reports might have a different structure than project definitions
+- News articles might need different processing than product documentation
+- Internal memos might require different handling than public announcements
+
+### General Configuration Options
+General configuration options, along with processor-specific configuration options, are used to allow you, the end-user, 
+to specialise the way documents are processed and chunked. This enables us to use your knowledge about document 
+structure with ours, to enable optimal storage of knowledge in Evie's Library.
+
+These general configuration options are available in the configuration of most processors, and introduce some additional 
+rules for chunking to take place. 
+
+1. **Chunking Heading Level**
+   - Defines the number of headings that should be used to split a file into meaningful chunks
+   - Default value: 2
+   - You can specify a value between 1 and 6
+2. **Chunking Patterns**
+   - Defines 'business sections' as regular expressions, allowing you force splitting the file if a section heading matches a given regular expression
+   - A practical example: '\bProfile\b' checks if the word 'Profile' is present in the title, and could be used if you have a lot of structured content describing your customers. There profile will be defined in a separate chunk.
+   - You can specify any number of patterns (on a new line)
+
+## Types of Processors
+
+### HTML Processor
+The HTML processor provides detailed control over how HTML content is processed through various configuration options.
+These options define the HTML processing, and they follow a strict process:
+
+1. **Included Elements** (`html_included_elements`)
+   - Defines the main sections of the document to process
+   - Only content within these elements will be considered
+   - Example: `article, main`
+
+2. **Excluded Elements** (`html_excluded_elements`)
+   - Elements to remove from included sections
+   - Useful for navigation, headers, footers, etc.
+   - Example: `header, footer, nav, script`
+
+3. **HTML Tags** (`html_tags`)
+   - Specific content tags to process
+   - Determines what content is actually stored
+   - Example: `p, h1, h2, h3, table`
+
+4. **Excluded Classes** (`html_excluded_classes`)
+   - Classes of elements to exclude
+   - Useful for removing specific sections like sidebars
+   - Example: `sidebar, advertisement`
+
+This is graphically shown in this process flow:
+
+```mermaid
+flowchart TD
+    A[HTML Document] --> B[Check Included Elements]
+    B --> C[Remove Excluded Elements]
+    C --> D[Process HTML Tags]
+    D --> E[Remove Excluded Classes]
+    
+    style A fill:#9c2d66,stroke:#333,stroke-width:2px
+    style B fill:#423372,stroke:#333,stroke-width:2px
+    style C fill:#423372,stroke:#333,stroke-width:2px
+    style D fill:#423372,stroke:#333,stroke-width:2px
+    style E fill:#423372,stroke:#333,stroke-width:2px
+
+    note1[Only process content within<br>included elements]
+    note2[Remove specified elements<br>from included content]
+    note3[Include only specified<br>HTML tags]
+    note4[Remove content with<br>excluded classes]
+
+    B --- note1
+    C --- note2
+    D --- note3
+    E --- note4
+```
+
+Additional HTML processing settings:
+- **End Tags** (`html_end_tags`): Defines where chunks can logically end
+- **Chunk Sizing**: Respects catalog's min/max chunk sizes while maintaining content coherence
+
+### PDF Processor
+- Handles PDF document processing
+- No additional configuration required
+- Automatically extracts text and maintains document structure
+
+### Markdown Processor
+- Handles Markdown document processing 
+- No additional configuration required
+- Automatically extracts text and maintains document structure
+
+### Docx Processor
+The Docx processor provides detailed control on how to handle Microsoft Office documents through various configuration 
+options:
+
+1. **Extract Comments**
+   - Whether to include document comments in the markdown
+   - Default: False
+2. **Extract Headers/Footers**
+   - Whether to include headers and footers in the markdown
+   - Default: False
+3. **Preserve Formatting**
+   - Whether to preserve bold, italic, and other text formatting
+   - Default: True
+4. **List Style**
+   - How to format lists
+   - One of ["dash", "asterisk", "plus"]
+   - Default value: "dash"
+5. **Image Handling**
+   - How to handle embedded images
+   - One of ["skip", "extract", "placeholder"]
+   - Default value: "skip"
+6. **Table Alignment**
+   - How to align table contents
+   - One of ["left", "center", "preserve"]
+   - Default value: "left"
+
+### Audio Processor
+- Supports mp3, mp4, and ogg formats
+- Automatically transcribes audio content
+- No additional configuration required
+
+## Managing Processors
+
+### Creating Processors
+
+When creating a processor, consider:
+1. Which catalog it will serve
+2. What file type it needs to handle
+3. Whether you need a sub-file type
+4. Any specific configuration needed (especially for HTML)
+
+### Modifying Processors
+
+- Processors can be modified after creation
+- Changes don't automatically apply to already processed documents
+- Documents must be manually reprocessed to apply new processor settings
+
+### Processor Tuning
+
+The tuning feature is available for implementation teams to:
+- Generate additional processing logs
+- Gain insights into exact processing behavior
+- Fine-tune processor behavior for specific contexts
+
+Note: Tuning logs are only accessible to implementation teams, not end users.
+
+## Best Practices
+
+1. **Plan Your Structure**
+   - Define clear sub-file types based on document characteristics
+   - Consider document structure when configuring HTML processors
+
+2. **HTML Processing**
+   - Start with broader included elements and narrow down
+   - Test configuration with representative documents
+   - Use excluded classes for fine-grained control
+
+3. **General Tips**
+   - Keep file sizes under the 50MB limit
+   - Document your processor configurations
+   - Plan for periodic review of processor settings
+
+## Future Extensions
+
+The processor framework is designed for extensibility:
+- Additional processor types planned
+- Metadata fields ready for future features
+- Configuration options may be expanded
--- a/docs/Library/retrievers.md
+++ b/docs/Library/retrievers.md
@@ -0,0 +1,185 @@
+---
+id: retrievers
+title: Understanding Retrievers
+description: Learn how retrievers find and extract relevant information from your documents
+sidebar_label: Retrievers
+sidebar_position: 3
+---
+
+# Understanding Retrievers
+
+## Overview
+
+Retrievers are essential components in Evie's Library that help find and extract relevant information from your documents. 
+Think of retrievers as intelligent search engines that understand the meaning behind your questions and find the most 
+relevant content from your stored documents.
+
+```mermaid
+classDiagram
+    class Catalog {
+        +id: Integer
+        +name: String
+        +description: Text
+        +type: String
+        +min_chunk_size: Integer
+        +max_chunk_size: Integer
+        +user_metadata: JSON
+    }
+    
+    class Retriever {
+        +id: Integer
+        +name: String
+        +description: Text
+        +catalog_id: Integer
+        +type: String
+        +tuning: Boolean
+        +configuration: JSON
+        +arguments: JSON
+    }
+    
+    class StandardRAGRetriever {
+        +configuration
+        es_k: Integer
+        es_similarity_threshold: Float
+        +arguments
+        query: String
+    }
+    
+    class DossierRetriever {
+        +configuration
+        es_k: Integer
+        es_similarity_threshold: Float
+        tag_conditions: JSON
+        +arguments
+        query: String
+    }
+    
+    Catalog "1" -- "*" Retriever : has
+    Retriever <|-- StandardRAGRetriever
+    Retriever <|-- DossierRetriever
+    
+    note for StandardRAGRetriever "Default similarity threshold: 0.3<br>Default es_k: 8"
+    note for DossierRetriever "Coming soon<br>Specialized for Dossier catalogs"
+```
+## Key Concepts
+
+### What is a Retriever?
+
+A retriever is responsible for:
+- Understanding the meaning of your questions
+- Searching through document chunks in your catalog
+- Finding the most relevant information based on semantic similarity
+- Providing context for Evie's responses
+
+```mermaid
+flowchart LR
+    A[User Question] --> B[Retriever]
+    B --> C[Document Chunks]
+    C --> D[Most Relevant Information]
+    D --> E[Evie's Response]
+    
+    style A fill:#9c2d66,stroke:#333,stroke-width:2px
+    style B fill:#423372,stroke:#333,stroke-width:2px
+    style C fill:#423372,stroke:#333,stroke-width:2px
+    style D fill:#423372,stroke:#333,stroke-width:2px
+    style E fill:#9c2d66,stroke:#333,stroke-width:2px
+```
+
+### How Retrievers Work
+
+When you ask Evie a question, the retriever:
+1. Analyzes your question to understand its meaning
+2. Compares it with stored document chunks
+3. Assigns similarity scores to each chunk
+4. Returns the most relevant chunks based on configuration settings
+
+## Types of Retrievers
+
+### Standard RAG Retriever
+
+The Standard RAG (Retrieval-Augmented Generation) Retriever is the default option suitable for most use cases. It 
+searches through all documents in a catalog to find relevant information.
+
+Configuration options include:
+- **Maximum Results (es_k)**: Controls how many document chunks to retrieve (default: 8)
+- **Similarity Threshold**: Determines how closely chunks must match your question (default: 0.3)
+  - Lower threshold = stricter matching
+  - Higher threshold = more permissive matching
+
+### Dossier Retriever (Coming Soon)
+
+A specialized retriever for Dossier catalogs that will allow:
+- Filtering by document tags
+- Creating specific "viewpoints" based on tag combinations
+- Combining semantic search with tag-based filtering
+
+## Setting Up Retrievers
+
+### Creating a New Retriever
+
+To create a retriever:
+1. Enter standard values such as name and description
+2. Select the target catalog
+3. Choose the retriever type
+4. After saving, you will have the ability to set the specific configuration (based on the type)
+
+### Configuration Best Practices
+
+1. **Similarity Threshold Tuning**:
+   - Start with the default 0.3 threshold
+   - If receiving too much information: Lower the threshold
+   - If receiving too little information: Raise the threshold
+
+2. **Multiple Retrievers**:
+   You can create multiple retrievers for the same catalog to serve different purposes. For example:
+   - A broad retriever with higher threshold for general questions
+   - A strict retriever with lower threshold for specific queries
+   - Different retrievers for different document subsets (in Dossier catalogs)
+
+## Practical Examples
+
+### Standard RAG Retriever Example
+
+```json
+{
+  "name": "General Knowledge Retriever",
+  "type": "STANDARD_RAG",
+  "configuration": {
+    "es_k": 8,
+    "es_similarity_threshold": 0.3
+  }
+}
+```
+
+### Future Dossier Retriever Example
+
+```json
+{
+  "name": "Quarterly Reports 2024",
+  "type": "DOSSIER_RAG",
+  "configuration": {
+    "es_k": 8,
+    "es_similarity_threshold": 0.3,
+    "tag_conditions": {
+      "document_type": "quarterly_report",
+      "year": 2024
+    }
+  }
+}
+```
+
+## Tips for Optimal Retrieval
+
+1. **Name Retrievers Clearly**:
+   Use descriptive names that indicate their purpose and configuration
+
+2. **Monitor Performance**:
+   - If answers are missing important information, consider:
+     - Increasing the similarity threshold
+     - Increasing the maximum results (es_k)
+   - If answers contain irrelevant information, consider:
+     - Decreasing the similarity threshold
+     - Decreasing the maximum results
+
+3. **Use Multiple Retrievers**:
+   Create specialized retrievers for different use cases within the same catalog