Files
eveai_docs/docs/Library/processors.md
2025-12-11 14:43:16 +01:00

231 lines
8.3 KiB
Markdown

---
id: processors
title: Understanding Processors
description: Learn how processors handle different types of files in Evie's Library
sidebar_label: Processors
sidebar_position: 2
---
# Understanding Processors
## Overview
Processors are essential components in Evie's Library that determine how different types of files (HTML, PDF, audio)
are processed and stored. Each processor is specifically configured for a catalog and handles one or more file types,
ensuring that content is properly extracted, chunked, and stored according to your business needs.
We will use the term *Chunks* or *Chunking* throughout the remainder of this chapter. Chunking is the process of
splitting large files into smaller pieces for storage. This enables Evie to more precisely capture the semantics of each
piece, and it enables Evie to return smaller and more specific pieces of content related to a question that is asked.
Evie has some proprietary algorithms to intelligently process and chunk files, to optimize the knowledge in her Library.
The relationship between Catalogs and Processors is shown in this diagram:
```mermaid
classDiagram
class Catalog {
+id: Integer
+name: String
+description: Text
+type: String
+min_chunk_size: Integer
+max_chunk_size: Integer
+user_metadata: JSON
+system_metadata: JSON
+configuration: JSON
}
class Processor {
+id: Integer
+name: String
+description: Text
+type: String
+sub_file_type: String
+tuning: Boolean
+user_metadata: JSON
+system_metadata: JSON
+configuration: JSON
}
Catalog "1" -- "*" Processor : has
note for Processor "Unique per catalog\nConfigured for specific file types"
```
## Key Concepts
### Processor Basics
- **Catalog-Specific**: Each processor is uniquely defined for a specific catalog and cannot be shared across catalogs
- **File Type Support**: Currently supports HTML, PDF, Markdown, Docx and audio files
- **Optional Usage**: If no processor is specified for a file type, files of that type will not be processed
- **File Size Limits**: Maximum file size of 50MB for all file types
### Sub-File Types
Sub-file types allow you to process different document types within the same catalog using different rules. For example:
- Quarterly reports might have a different structure than project definitions
- News articles might need different processing than product documentation
- Internal memos might require different handling than public announcements
### General Configuration Options
General configuration options, along with processor-specific configuration options, are used to allow you, the end-user,
to specialise the way documents are processed and chunked. This enables us to use your knowledge about document
structure with ours, to enable optimal storage of knowledge in Evie's Library.
These general configuration options are available in the configuration of most processors, and introduce some additional
rules for chunking to take place.
1. **Chunking Heading Level**
- Defines the number of headings that should be used to split a file into meaningful chunks
- Default value: 2
- You can specify a value between 1 and 6
2. **Chunking Patterns**
- Defines 'business sections' as regular expressions, allowing you force splitting the file if a section heading matches a given regular expression
- A practical example: '\bProfile\b' checks if the word 'Profile' is present in the title, and could be used if you have a lot of structured content describing your customers. There profile will be defined in a separate chunk.
- You can specify any number of patterns (on a new line)
## Types of Processors
### HTML Processor
The HTML processor provides detailed control over how HTML content is processed through various configuration options.
These options define the HTML processing, and they follow a strict process:
1. **Included Elements** (`html_included_elements`)
- Defines the main sections of the document to process
- Only content within these elements will be considered
- Example: `article, main`
2. **Excluded Elements** (`html_excluded_elements`)
- Elements to remove from included sections
- Useful for navigation, headers, footers, etc.
- Example: `header, footer, nav, script`
3. **HTML Tags** (`html_tags`)
- Specific content tags to process
- Determines what content is actually stored
- Example: `p, h1, h2, h3, table`
4. **Excluded Classes** (`html_excluded_classes`)
- Classes of elements to exclude
- Useful for removing specific sections like sidebars
- Example: `sidebar, advertisement`
This is graphically shown in this process flow:
```mermaid
flowchart TD
A[HTML Document] --> B[Check Included Elements]
B --> C[Remove Excluded Elements]
C --> D[Process HTML Tags]
D --> E[Remove Excluded Classes]
style A fill:#9c2d66,stroke:#333,stroke-width:2px
style B fill:#423372,stroke:#333,stroke-width:2px
style C fill:#423372,stroke:#333,stroke-width:2px
style D fill:#423372,stroke:#333,stroke-width:2px
style E fill:#423372,stroke:#333,stroke-width:2px
note1[Only process content within<br>included elements]
note2[Remove specified elements<br>from included content]
note3[Include only specified<br>HTML tags]
note4[Remove content with<br>excluded classes]
B --- note1
C --- note2
D --- note3
E --- note4
```
Additional HTML processing settings:
- **End Tags** (`html_end_tags`): Defines where chunks can logically end
- **Chunk Sizing**: Respects catalog's min/max chunk sizes while maintaining content coherence
### PDF Processor
- Handles PDF document processing
- No additional configuration required
- Automatically extracts text and maintains document structure
### Markdown Processor
- Handles Markdown document processing
- No additional configuration required
- Automatically extracts text and maintains document structure
### Docx Processor
The Docx processor provides detailed control on how to handle Microsoft Office documents through various configuration
options:
1. **Extract Comments**
- Whether to include document comments in the markdown
- Default: False
2. **Extract Headers/Footers**
- Whether to include headers and footers in the markdown
- Default: False
3. **Preserve Formatting**
- Whether to preserve bold, italic, and other text formatting
- Default: True
4. **List Style**
- How to format lists
- One of ["dash", "asterisk", "plus"]
- Default value: "dash"
5. **Image Handling**
- How to handle embedded images
- One of ["skip", "extract", "placeholder"]
- Default value: "skip"
6. **Table Alignment**
- How to align table contents
- One of ["left", "center", "preserve"]
- Default value: "left"
### Audio Processor
- Supports mp3, mp4, and ogg formats
- Automatically transcribes audio content
- No additional configuration required
## Managing Processors
### Creating Processors
When creating a processor, consider:
1. Which catalog it will serve
2. What file type it needs to handle
3. Whether you need a sub-file type
4. Any specific configuration needed (especially for HTML)
### Modifying Processors
- Processors can be modified after creation
- Changes don't automatically apply to already processed documents
- Documents must be manually reprocessed to apply new processor settings
### Processor Tuning
The tuning feature is available for implementation teams to:
- Generate additional processing logs
- Gain insights into exact processing behavior
- Fine-tune processor behavior for specific contexts
Note: Tuning logs are only accessible to implementation teams, not end users.
## Best Practices
1. **Plan Your Structure**
- Define clear sub-file types based on document characteristics
- Consider document structure when configuring HTML processors
2. **HTML Processing**
- Start with broader included elements and narrow down
- Test configuration with representative documents
- Use excluded classes for fine-grained control
3. **General Tips**
- Keep file sizes under the 50MB limit
- Document your processor configurations
- Plan for periodic review of processor settings
## Future Extensions
The processor framework is designed for extensibility:
- Additional processor types planned
- Metadata fields ready for future features
- Configuration options may be expanded