Initial commit

2025-12-11 14:43:16 +01:00
commit 7ef62972f3
36 changed files with 23641 additions and 0 deletions
--- a/docs/Library/processors.md
+++ b/docs/Library/processors.md
@@ -0,0 +1,230 @@
+---
+id: processors
+title: Understanding Processors
+description: Learn how processors handle different types of files in Evie's Library
+sidebar_label: Processors
+sidebar_position: 2
+---
+
+# Understanding Processors
+
+## Overview
+
+Processors are essential components in Evie's Library that determine how different types of files (HTML, PDF, audio) 
+are processed and stored. Each processor is specifically configured for a catalog and handles one or more file types, 
+ensuring that content is properly extracted, chunked, and stored according to your business needs.
+
+We will use the term *Chunks* or *Chunking* throughout the remainder of this chapter. Chunking is the process of 
+splitting large files into smaller pieces for storage. This enables Evie to more precisely capture the semantics of each
+piece, and it enables Evie to return smaller and more specific pieces of content related to a question that is asked.
+
+Evie has some proprietary algorithms to intelligently process and chunk files, to optimize the knowledge in her Library.
+
+The relationship between Catalogs and Processors is shown in this diagram:
+
+```mermaid
+classDiagram
+    class Catalog {
+        +id: Integer
+        +name: String
+        +description: Text
+        +type: String
+        +min_chunk_size: Integer
+        +max_chunk_size: Integer
+        +user_metadata: JSON
+        +system_metadata: JSON
+        +configuration: JSON
+    }
+    class Processor {
+        +id: Integer
+        +name: String
+        +description: Text
+        +type: String
+        +sub_file_type: String
+        +tuning: Boolean
+        +user_metadata: JSON
+        +system_metadata: JSON
+        +configuration: JSON
+    }
+    
+    Catalog "1" -- "*" Processor : has
+    note for Processor "Unique per catalog\nConfigured for specific file types"
+```
+
+## Key Concepts
+
+### Processor Basics
+
+- **Catalog-Specific**: Each processor is uniquely defined for a specific catalog and cannot be shared across catalogs
+- **File Type Support**: Currently supports HTML, PDF, Markdown, Docx and audio files
+- **Optional Usage**: If no processor is specified for a file type, files of that type will not be processed
+- **File Size Limits**: Maximum file size of 50MB for all file types
+
+### Sub-File Types
+
+Sub-file types allow you to process different document types within the same catalog using different rules. For example:
+- Quarterly reports might have a different structure than project definitions
+- News articles might need different processing than product documentation
+- Internal memos might require different handling than public announcements
+
+### General Configuration Options
+General configuration options, along with processor-specific configuration options, are used to allow you, the end-user, 
+to specialise the way documents are processed and chunked. This enables us to use your knowledge about document 
+structure with ours, to enable optimal storage of knowledge in Evie's Library.
+
+These general configuration options are available in the configuration of most processors, and introduce some additional 
+rules for chunking to take place. 
+
+1. **Chunking Heading Level**
+   - Defines the number of headings that should be used to split a file into meaningful chunks
+   - Default value: 2
+   - You can specify a value between 1 and 6
+2. **Chunking Patterns**
+   - Defines 'business sections' as regular expressions, allowing you force splitting the file if a section heading matches a given regular expression
+   - A practical example: '\bProfile\b' checks if the word 'Profile' is present in the title, and could be used if you have a lot of structured content describing your customers. There profile will be defined in a separate chunk.
+   - You can specify any number of patterns (on a new line)
+
+## Types of Processors
+
+### HTML Processor
+The HTML processor provides detailed control over how HTML content is processed through various configuration options.
+These options define the HTML processing, and they follow a strict process:
+
+1. **Included Elements** (`html_included_elements`)
+   - Defines the main sections of the document to process
+   - Only content within these elements will be considered
+   - Example: `article, main`
+
+2. **Excluded Elements** (`html_excluded_elements`)
+   - Elements to remove from included sections
+   - Useful for navigation, headers, footers, etc.
+   - Example: `header, footer, nav, script`
+
+3. **HTML Tags** (`html_tags`)
+   - Specific content tags to process
+   - Determines what content is actually stored
+   - Example: `p, h1, h2, h3, table`
+
+4. **Excluded Classes** (`html_excluded_classes`)
+   - Classes of elements to exclude
+   - Useful for removing specific sections like sidebars
+   - Example: `sidebar, advertisement`
+
+This is graphically shown in this process flow:
+
+```mermaid
+flowchart TD
+    A[HTML Document] --> B[Check Included Elements]
+    B --> C[Remove Excluded Elements]
+    C --> D[Process HTML Tags]
+    D --> E[Remove Excluded Classes]
+    
+    style A fill:#9c2d66,stroke:#333,stroke-width:2px
+    style B fill:#423372,stroke:#333,stroke-width:2px
+    style C fill:#423372,stroke:#333,stroke-width:2px
+    style D fill:#423372,stroke:#333,stroke-width:2px
+    style E fill:#423372,stroke:#333,stroke-width:2px
+
+    note1[Only process content within<br>included elements]
+    note2[Remove specified elements<br>from included content]
+    note3[Include only specified<br>HTML tags]
+    note4[Remove content with<br>excluded classes]
+
+    B --- note1
+    C --- note2
+    D --- note3
+    E --- note4
+```
+
+Additional HTML processing settings:
+- **End Tags** (`html_end_tags`): Defines where chunks can logically end
+- **Chunk Sizing**: Respects catalog's min/max chunk sizes while maintaining content coherence
+
+### PDF Processor
+- Handles PDF document processing
+- No additional configuration required
+- Automatically extracts text and maintains document structure
+
+### Markdown Processor
+- Handles Markdown document processing 
+- No additional configuration required
+- Automatically extracts text and maintains document structure
+
+### Docx Processor
+The Docx processor provides detailed control on how to handle Microsoft Office documents through various configuration 
+options:
+
+1. **Extract Comments**
+   - Whether to include document comments in the markdown
+   - Default: False
+2. **Extract Headers/Footers**
+   - Whether to include headers and footers in the markdown
+   - Default: False
+3. **Preserve Formatting**
+   - Whether to preserve bold, italic, and other text formatting
+   - Default: True
+4. **List Style**
+   - How to format lists
+   - One of ["dash", "asterisk", "plus"]
+   - Default value: "dash"
+5. **Image Handling**
+   - How to handle embedded images
+   - One of ["skip", "extract", "placeholder"]
+   - Default value: "skip"
+6. **Table Alignment**
+   - How to align table contents
+   - One of ["left", "center", "preserve"]
+   - Default value: "left"
+
+### Audio Processor
+- Supports mp3, mp4, and ogg formats
+- Automatically transcribes audio content
+- No additional configuration required
+
+## Managing Processors
+
+### Creating Processors
+
+When creating a processor, consider:
+1. Which catalog it will serve
+2. What file type it needs to handle
+3. Whether you need a sub-file type
+4. Any specific configuration needed (especially for HTML)
+
+### Modifying Processors
+
+- Processors can be modified after creation
+- Changes don't automatically apply to already processed documents
+- Documents must be manually reprocessed to apply new processor settings
+
+### Processor Tuning
+
+The tuning feature is available for implementation teams to:
+- Generate additional processing logs
+- Gain insights into exact processing behavior
+- Fine-tune processor behavior for specific contexts
+
+Note: Tuning logs are only accessible to implementation teams, not end users.
+
+## Best Practices
+
+1. **Plan Your Structure**
+   - Define clear sub-file types based on document characteristics
+   - Consider document structure when configuring HTML processors
+
+2. **HTML Processing**
+   - Start with broader included elements and narrow down
+   - Test configuration with representative documents
+   - Use excluded classes for fine-grained control
+
+3. **General Tips**
+   - Keep file sizes under the 50MB limit
+   - Document your processor configurations
+   - Plan for periodic review of processor settings
+
+## Future Extensions
+
+The processor framework is designed for extensibility:
+- Additional processor types planned
+- Metadata fields ready for future features
+- Configuration options may be expanded