Initial commit
This commit is contained in:
230
docs/Library/processors.md
Normal file
230
docs/Library/processors.md
Normal file
@@ -0,0 +1,230 @@
|
||||
---
|
||||
id: processors
|
||||
title: Understanding Processors
|
||||
description: Learn how processors handle different types of files in Evie's Library
|
||||
sidebar_label: Processors
|
||||
sidebar_position: 2
|
||||
---
|
||||
|
||||
# Understanding Processors
|
||||
|
||||
## Overview
|
||||
|
||||
Processors are essential components in Evie's Library that determine how different types of files (HTML, PDF, audio)
|
||||
are processed and stored. Each processor is specifically configured for a catalog and handles one or more file types,
|
||||
ensuring that content is properly extracted, chunked, and stored according to your business needs.
|
||||
|
||||
We will use the term *Chunks* or *Chunking* throughout the remainder of this chapter. Chunking is the process of
|
||||
splitting large files into smaller pieces for storage. This enables Evie to more precisely capture the semantics of each
|
||||
piece, and it enables Evie to return smaller and more specific pieces of content related to a question that is asked.
|
||||
|
||||
Evie has some proprietary algorithms to intelligently process and chunk files, to optimize the knowledge in her Library.
|
||||
|
||||
The relationship between Catalogs and Processors is shown in this diagram:
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
class Catalog {
|
||||
+id: Integer
|
||||
+name: String
|
||||
+description: Text
|
||||
+type: String
|
||||
+min_chunk_size: Integer
|
||||
+max_chunk_size: Integer
|
||||
+user_metadata: JSON
|
||||
+system_metadata: JSON
|
||||
+configuration: JSON
|
||||
}
|
||||
class Processor {
|
||||
+id: Integer
|
||||
+name: String
|
||||
+description: Text
|
||||
+type: String
|
||||
+sub_file_type: String
|
||||
+tuning: Boolean
|
||||
+user_metadata: JSON
|
||||
+system_metadata: JSON
|
||||
+configuration: JSON
|
||||
}
|
||||
|
||||
Catalog "1" -- "*" Processor : has
|
||||
note for Processor "Unique per catalog\nConfigured for specific file types"
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Processor Basics
|
||||
|
||||
- **Catalog-Specific**: Each processor is uniquely defined for a specific catalog and cannot be shared across catalogs
|
||||
- **File Type Support**: Currently supports HTML, PDF, Markdown, Docx and audio files
|
||||
- **Optional Usage**: If no processor is specified for a file type, files of that type will not be processed
|
||||
- **File Size Limits**: Maximum file size of 50MB for all file types
|
||||
|
||||
### Sub-File Types
|
||||
|
||||
Sub-file types allow you to process different document types within the same catalog using different rules. For example:
|
||||
- Quarterly reports might have a different structure than project definitions
|
||||
- News articles might need different processing than product documentation
|
||||
- Internal memos might require different handling than public announcements
|
||||
|
||||
### General Configuration Options
|
||||
General configuration options, along with processor-specific configuration options, are used to allow you, the end-user,
|
||||
to specialise the way documents are processed and chunked. This enables us to use your knowledge about document
|
||||
structure with ours, to enable optimal storage of knowledge in Evie's Library.
|
||||
|
||||
These general configuration options are available in the configuration of most processors, and introduce some additional
|
||||
rules for chunking to take place.
|
||||
|
||||
1. **Chunking Heading Level**
|
||||
- Defines the number of headings that should be used to split a file into meaningful chunks
|
||||
- Default value: 2
|
||||
- You can specify a value between 1 and 6
|
||||
2. **Chunking Patterns**
|
||||
- Defines 'business sections' as regular expressions, allowing you force splitting the file if a section heading matches a given regular expression
|
||||
- A practical example: '\bProfile\b' checks if the word 'Profile' is present in the title, and could be used if you have a lot of structured content describing your customers. There profile will be defined in a separate chunk.
|
||||
- You can specify any number of patterns (on a new line)
|
||||
|
||||
## Types of Processors
|
||||
|
||||
### HTML Processor
|
||||
The HTML processor provides detailed control over how HTML content is processed through various configuration options.
|
||||
These options define the HTML processing, and they follow a strict process:
|
||||
|
||||
1. **Included Elements** (`html_included_elements`)
|
||||
- Defines the main sections of the document to process
|
||||
- Only content within these elements will be considered
|
||||
- Example: `article, main`
|
||||
|
||||
2. **Excluded Elements** (`html_excluded_elements`)
|
||||
- Elements to remove from included sections
|
||||
- Useful for navigation, headers, footers, etc.
|
||||
- Example: `header, footer, nav, script`
|
||||
|
||||
3. **HTML Tags** (`html_tags`)
|
||||
- Specific content tags to process
|
||||
- Determines what content is actually stored
|
||||
- Example: `p, h1, h2, h3, table`
|
||||
|
||||
4. **Excluded Classes** (`html_excluded_classes`)
|
||||
- Classes of elements to exclude
|
||||
- Useful for removing specific sections like sidebars
|
||||
- Example: `sidebar, advertisement`
|
||||
|
||||
This is graphically shown in this process flow:
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[HTML Document] --> B[Check Included Elements]
|
||||
B --> C[Remove Excluded Elements]
|
||||
C --> D[Process HTML Tags]
|
||||
D --> E[Remove Excluded Classes]
|
||||
|
||||
style A fill:#9c2d66,stroke:#333,stroke-width:2px
|
||||
style B fill:#423372,stroke:#333,stroke-width:2px
|
||||
style C fill:#423372,stroke:#333,stroke-width:2px
|
||||
style D fill:#423372,stroke:#333,stroke-width:2px
|
||||
style E fill:#423372,stroke:#333,stroke-width:2px
|
||||
|
||||
note1[Only process content within<br>included elements]
|
||||
note2[Remove specified elements<br>from included content]
|
||||
note3[Include only specified<br>HTML tags]
|
||||
note4[Remove content with<br>excluded classes]
|
||||
|
||||
B --- note1
|
||||
C --- note2
|
||||
D --- note3
|
||||
E --- note4
|
||||
```
|
||||
|
||||
Additional HTML processing settings:
|
||||
- **End Tags** (`html_end_tags`): Defines where chunks can logically end
|
||||
- **Chunk Sizing**: Respects catalog's min/max chunk sizes while maintaining content coherence
|
||||
|
||||
### PDF Processor
|
||||
- Handles PDF document processing
|
||||
- No additional configuration required
|
||||
- Automatically extracts text and maintains document structure
|
||||
|
||||
### Markdown Processor
|
||||
- Handles Markdown document processing
|
||||
- No additional configuration required
|
||||
- Automatically extracts text and maintains document structure
|
||||
|
||||
### Docx Processor
|
||||
The Docx processor provides detailed control on how to handle Microsoft Office documents through various configuration
|
||||
options:
|
||||
|
||||
1. **Extract Comments**
|
||||
- Whether to include document comments in the markdown
|
||||
- Default: False
|
||||
2. **Extract Headers/Footers**
|
||||
- Whether to include headers and footers in the markdown
|
||||
- Default: False
|
||||
3. **Preserve Formatting**
|
||||
- Whether to preserve bold, italic, and other text formatting
|
||||
- Default: True
|
||||
4. **List Style**
|
||||
- How to format lists
|
||||
- One of ["dash", "asterisk", "plus"]
|
||||
- Default value: "dash"
|
||||
5. **Image Handling**
|
||||
- How to handle embedded images
|
||||
- One of ["skip", "extract", "placeholder"]
|
||||
- Default value: "skip"
|
||||
6. **Table Alignment**
|
||||
- How to align table contents
|
||||
- One of ["left", "center", "preserve"]
|
||||
- Default value: "left"
|
||||
|
||||
### Audio Processor
|
||||
- Supports mp3, mp4, and ogg formats
|
||||
- Automatically transcribes audio content
|
||||
- No additional configuration required
|
||||
|
||||
## Managing Processors
|
||||
|
||||
### Creating Processors
|
||||
|
||||
When creating a processor, consider:
|
||||
1. Which catalog it will serve
|
||||
2. What file type it needs to handle
|
||||
3. Whether you need a sub-file type
|
||||
4. Any specific configuration needed (especially for HTML)
|
||||
|
||||
### Modifying Processors
|
||||
|
||||
- Processors can be modified after creation
|
||||
- Changes don't automatically apply to already processed documents
|
||||
- Documents must be manually reprocessed to apply new processor settings
|
||||
|
||||
### Processor Tuning
|
||||
|
||||
The tuning feature is available for implementation teams to:
|
||||
- Generate additional processing logs
|
||||
- Gain insights into exact processing behavior
|
||||
- Fine-tune processor behavior for specific contexts
|
||||
|
||||
Note: Tuning logs are only accessible to implementation teams, not end users.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Plan Your Structure**
|
||||
- Define clear sub-file types based on document characteristics
|
||||
- Consider document structure when configuring HTML processors
|
||||
|
||||
2. **HTML Processing**
|
||||
- Start with broader included elements and narrow down
|
||||
- Test configuration with representative documents
|
||||
- Use excluded classes for fine-grained control
|
||||
|
||||
3. **General Tips**
|
||||
- Keep file sizes under the 50MB limit
|
||||
- Document your processor configurations
|
||||
- Plan for periodic review of processor settings
|
||||
|
||||
## Future Extensions
|
||||
|
||||
The processor framework is designed for extensibility:
|
||||
- Additional processor types planned
|
||||
- Metadata fields ready for future features
|
||||
- Configuration options may be expanded
|
||||
Reference in New Issue
Block a user