231 lines
8.3 KiB
Markdown
231 lines
8.3 KiB
Markdown
---
|
|
id: processors
|
|
title: Understanding Processors
|
|
description: Learn how processors handle different types of files in Evie's Library
|
|
sidebar_label: Processors
|
|
sidebar_position: 2
|
|
---
|
|
|
|
# Understanding Processors
|
|
|
|
## Overview
|
|
|
|
Processors are essential components in Evie's Library that determine how different types of files (HTML, PDF, audio)
|
|
are processed and stored. Each processor is specifically configured for a catalog and handles one or more file types,
|
|
ensuring that content is properly extracted, chunked, and stored according to your business needs.
|
|
|
|
We will use the term *Chunks* or *Chunking* throughout the remainder of this chapter. Chunking is the process of
|
|
splitting large files into smaller pieces for storage. This enables Evie to more precisely capture the semantics of each
|
|
piece, and it enables Evie to return smaller and more specific pieces of content related to a question that is asked.
|
|
|
|
Evie has some proprietary algorithms to intelligently process and chunk files, to optimize the knowledge in her Library.
|
|
|
|
The relationship between Catalogs and Processors is shown in this diagram:
|
|
|
|
```mermaid
|
|
classDiagram
|
|
class Catalog {
|
|
+id: Integer
|
|
+name: String
|
|
+description: Text
|
|
+type: String
|
|
+min_chunk_size: Integer
|
|
+max_chunk_size: Integer
|
|
+user_metadata: JSON
|
|
+system_metadata: JSON
|
|
+configuration: JSON
|
|
}
|
|
class Processor {
|
|
+id: Integer
|
|
+name: String
|
|
+description: Text
|
|
+type: String
|
|
+sub_file_type: String
|
|
+tuning: Boolean
|
|
+user_metadata: JSON
|
|
+system_metadata: JSON
|
|
+configuration: JSON
|
|
}
|
|
|
|
Catalog "1" -- "*" Processor : has
|
|
note for Processor "Unique per catalog\nConfigured for specific file types"
|
|
```
|
|
|
|
## Key Concepts
|
|
|
|
### Processor Basics
|
|
|
|
- **Catalog-Specific**: Each processor is uniquely defined for a specific catalog and cannot be shared across catalogs
|
|
- **File Type Support**: Currently supports HTML, PDF, Markdown, Docx and audio files
|
|
- **Optional Usage**: If no processor is specified for a file type, files of that type will not be processed
|
|
- **File Size Limits**: Maximum file size of 50MB for all file types
|
|
|
|
### Sub-File Types
|
|
|
|
Sub-file types allow you to process different document types within the same catalog using different rules. For example:
|
|
- Quarterly reports might have a different structure than project definitions
|
|
- News articles might need different processing than product documentation
|
|
- Internal memos might require different handling than public announcements
|
|
|
|
### General Configuration Options
|
|
General configuration options, along with processor-specific configuration options, are used to allow you, the end-user,
|
|
to specialise the way documents are processed and chunked. This enables us to use your knowledge about document
|
|
structure with ours, to enable optimal storage of knowledge in Evie's Library.
|
|
|
|
These general configuration options are available in the configuration of most processors, and introduce some additional
|
|
rules for chunking to take place.
|
|
|
|
1. **Chunking Heading Level**
|
|
- Defines the number of headings that should be used to split a file into meaningful chunks
|
|
- Default value: 2
|
|
- You can specify a value between 1 and 6
|
|
2. **Chunking Patterns**
|
|
- Defines 'business sections' as regular expressions, allowing you force splitting the file if a section heading matches a given regular expression
|
|
- A practical example: '\bProfile\b' checks if the word 'Profile' is present in the title, and could be used if you have a lot of structured content describing your customers. There profile will be defined in a separate chunk.
|
|
- You can specify any number of patterns (on a new line)
|
|
|
|
## Types of Processors
|
|
|
|
### HTML Processor
|
|
The HTML processor provides detailed control over how HTML content is processed through various configuration options.
|
|
These options define the HTML processing, and they follow a strict process:
|
|
|
|
1. **Included Elements** (`html_included_elements`)
|
|
- Defines the main sections of the document to process
|
|
- Only content within these elements will be considered
|
|
- Example: `article, main`
|
|
|
|
2. **Excluded Elements** (`html_excluded_elements`)
|
|
- Elements to remove from included sections
|
|
- Useful for navigation, headers, footers, etc.
|
|
- Example: `header, footer, nav, script`
|
|
|
|
3. **HTML Tags** (`html_tags`)
|
|
- Specific content tags to process
|
|
- Determines what content is actually stored
|
|
- Example: `p, h1, h2, h3, table`
|
|
|
|
4. **Excluded Classes** (`html_excluded_classes`)
|
|
- Classes of elements to exclude
|
|
- Useful for removing specific sections like sidebars
|
|
- Example: `sidebar, advertisement`
|
|
|
|
This is graphically shown in this process flow:
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[HTML Document] --> B[Check Included Elements]
|
|
B --> C[Remove Excluded Elements]
|
|
C --> D[Process HTML Tags]
|
|
D --> E[Remove Excluded Classes]
|
|
|
|
style A fill:#9c2d66,stroke:#333,stroke-width:2px
|
|
style B fill:#423372,stroke:#333,stroke-width:2px
|
|
style C fill:#423372,stroke:#333,stroke-width:2px
|
|
style D fill:#423372,stroke:#333,stroke-width:2px
|
|
style E fill:#423372,stroke:#333,stroke-width:2px
|
|
|
|
note1[Only process content within<br>included elements]
|
|
note2[Remove specified elements<br>from included content]
|
|
note3[Include only specified<br>HTML tags]
|
|
note4[Remove content with<br>excluded classes]
|
|
|
|
B --- note1
|
|
C --- note2
|
|
D --- note3
|
|
E --- note4
|
|
```
|
|
|
|
Additional HTML processing settings:
|
|
- **End Tags** (`html_end_tags`): Defines where chunks can logically end
|
|
- **Chunk Sizing**: Respects catalog's min/max chunk sizes while maintaining content coherence
|
|
|
|
### PDF Processor
|
|
- Handles PDF document processing
|
|
- No additional configuration required
|
|
- Automatically extracts text and maintains document structure
|
|
|
|
### Markdown Processor
|
|
- Handles Markdown document processing
|
|
- No additional configuration required
|
|
- Automatically extracts text and maintains document structure
|
|
|
|
### Docx Processor
|
|
The Docx processor provides detailed control on how to handle Microsoft Office documents through various configuration
|
|
options:
|
|
|
|
1. **Extract Comments**
|
|
- Whether to include document comments in the markdown
|
|
- Default: False
|
|
2. **Extract Headers/Footers**
|
|
- Whether to include headers and footers in the markdown
|
|
- Default: False
|
|
3. **Preserve Formatting**
|
|
- Whether to preserve bold, italic, and other text formatting
|
|
- Default: True
|
|
4. **List Style**
|
|
- How to format lists
|
|
- One of ["dash", "asterisk", "plus"]
|
|
- Default value: "dash"
|
|
5. **Image Handling**
|
|
- How to handle embedded images
|
|
- One of ["skip", "extract", "placeholder"]
|
|
- Default value: "skip"
|
|
6. **Table Alignment**
|
|
- How to align table contents
|
|
- One of ["left", "center", "preserve"]
|
|
- Default value: "left"
|
|
|
|
### Audio Processor
|
|
- Supports mp3, mp4, and ogg formats
|
|
- Automatically transcribes audio content
|
|
- No additional configuration required
|
|
|
|
## Managing Processors
|
|
|
|
### Creating Processors
|
|
|
|
When creating a processor, consider:
|
|
1. Which catalog it will serve
|
|
2. What file type it needs to handle
|
|
3. Whether you need a sub-file type
|
|
4. Any specific configuration needed (especially for HTML)
|
|
|
|
### Modifying Processors
|
|
|
|
- Processors can be modified after creation
|
|
- Changes don't automatically apply to already processed documents
|
|
- Documents must be manually reprocessed to apply new processor settings
|
|
|
|
### Processor Tuning
|
|
|
|
The tuning feature is available for implementation teams to:
|
|
- Generate additional processing logs
|
|
- Gain insights into exact processing behavior
|
|
- Fine-tune processor behavior for specific contexts
|
|
|
|
Note: Tuning logs are only accessible to implementation teams, not end users.
|
|
|
|
## Best Practices
|
|
|
|
1. **Plan Your Structure**
|
|
- Define clear sub-file types based on document characteristics
|
|
- Consider document structure when configuring HTML processors
|
|
|
|
2. **HTML Processing**
|
|
- Start with broader included elements and narrow down
|
|
- Test configuration with representative documents
|
|
- Use excluded classes for fine-grained control
|
|
|
|
3. **General Tips**
|
|
- Keep file sizes under the 50MB limit
|
|
- Document your processor configurations
|
|
- Plan for periodic review of processor settings
|
|
|
|
## Future Extensions
|
|
|
|
The processor framework is designed for extensibility:
|
|
- Additional processor types planned
|
|
- Metadata fields ready for future features
|
|
- Configuration options may be expanded
|