Files
eveai_docs/docs/Library/processors.md
2025-12-11 14:43:16 +01:00

8.3 KiB

id, title, description, sidebar_label, sidebar_position
id title description sidebar_label sidebar_position
processors Understanding Processors Learn how processors handle different types of files in Evie's Library Processors 2

Understanding Processors

Overview

Processors are essential components in Evie's Library that determine how different types of files (HTML, PDF, audio) are processed and stored. Each processor is specifically configured for a catalog and handles one or more file types, ensuring that content is properly extracted, chunked, and stored according to your business needs.

We will use the term Chunks or Chunking throughout the remainder of this chapter. Chunking is the process of splitting large files into smaller pieces for storage. This enables Evie to more precisely capture the semantics of each piece, and it enables Evie to return smaller and more specific pieces of content related to a question that is asked.

Evie has some proprietary algorithms to intelligently process and chunk files, to optimize the knowledge in her Library.

The relationship between Catalogs and Processors is shown in this diagram:

classDiagram
    class Catalog {
        +id: Integer
        +name: String
        +description: Text
        +type: String
        +min_chunk_size: Integer
        +max_chunk_size: Integer
        +user_metadata: JSON
        +system_metadata: JSON
        +configuration: JSON
    }
    class Processor {
        +id: Integer
        +name: String
        +description: Text
        +type: String
        +sub_file_type: String
        +tuning: Boolean
        +user_metadata: JSON
        +system_metadata: JSON
        +configuration: JSON
    }
    
    Catalog "1" -- "*" Processor : has
    note for Processor "Unique per catalog\nConfigured for specific file types"

Key Concepts

Processor Basics

  • Catalog-Specific: Each processor is uniquely defined for a specific catalog and cannot be shared across catalogs
  • File Type Support: Currently supports HTML, PDF, Markdown, Docx and audio files
  • Optional Usage: If no processor is specified for a file type, files of that type will not be processed
  • File Size Limits: Maximum file size of 50MB for all file types

Sub-File Types

Sub-file types allow you to process different document types within the same catalog using different rules. For example:

  • Quarterly reports might have a different structure than project definitions
  • News articles might need different processing than product documentation
  • Internal memos might require different handling than public announcements

General Configuration Options

General configuration options, along with processor-specific configuration options, are used to allow you, the end-user, to specialise the way documents are processed and chunked. This enables us to use your knowledge about document structure with ours, to enable optimal storage of knowledge in Evie's Library.

These general configuration options are available in the configuration of most processors, and introduce some additional rules for chunking to take place.

  1. Chunking Heading Level
    • Defines the number of headings that should be used to split a file into meaningful chunks
    • Default value: 2
    • You can specify a value between 1 and 6
  2. Chunking Patterns
    • Defines 'business sections' as regular expressions, allowing you force splitting the file if a section heading matches a given regular expression
    • A practical example: '\bProfile\b' checks if the word 'Profile' is present in the title, and could be used if you have a lot of structured content describing your customers. There profile will be defined in a separate chunk.
    • You can specify any number of patterns (on a new line)

Types of Processors

HTML Processor

The HTML processor provides detailed control over how HTML content is processed through various configuration options. These options define the HTML processing, and they follow a strict process:

  1. Included Elements (html_included_elements)

    • Defines the main sections of the document to process
    • Only content within these elements will be considered
    • Example: article, main
  2. Excluded Elements (html_excluded_elements)

    • Elements to remove from included sections
    • Useful for navigation, headers, footers, etc.
    • Example: header, footer, nav, script
  3. HTML Tags (html_tags)

    • Specific content tags to process
    • Determines what content is actually stored
    • Example: p, h1, h2, h3, table
  4. Excluded Classes (html_excluded_classes)

    • Classes of elements to exclude
    • Useful for removing specific sections like sidebars
    • Example: sidebar, advertisement

This is graphically shown in this process flow:

flowchart TD
    A[HTML Document] --> B[Check Included Elements]
    B --> C[Remove Excluded Elements]
    C --> D[Process HTML Tags]
    D --> E[Remove Excluded Classes]
    
    style A fill:#9c2d66,stroke:#333,stroke-width:2px
    style B fill:#423372,stroke:#333,stroke-width:2px
    style C fill:#423372,stroke:#333,stroke-width:2px
    style D fill:#423372,stroke:#333,stroke-width:2px
    style E fill:#423372,stroke:#333,stroke-width:2px

    note1[Only process content within<br>included elements]
    note2[Remove specified elements<br>from included content]
    note3[Include only specified<br>HTML tags]
    note4[Remove content with<br>excluded classes]

    B --- note1
    C --- note2
    D --- note3
    E --- note4

Additional HTML processing settings:

  • End Tags (html_end_tags): Defines where chunks can logically end
  • Chunk Sizing: Respects catalog's min/max chunk sizes while maintaining content coherence

PDF Processor

  • Handles PDF document processing
  • No additional configuration required
  • Automatically extracts text and maintains document structure

Markdown Processor

  • Handles Markdown document processing
  • No additional configuration required
  • Automatically extracts text and maintains document structure

Docx Processor

The Docx processor provides detailed control on how to handle Microsoft Office documents through various configuration options:

  1. Extract Comments
    • Whether to include document comments in the markdown
    • Default: False
  2. Extract Headers/Footers
    • Whether to include headers and footers in the markdown
    • Default: False
  3. Preserve Formatting
    • Whether to preserve bold, italic, and other text formatting
    • Default: True
  4. List Style
    • How to format lists
    • One of ["dash", "asterisk", "plus"]
    • Default value: "dash"
  5. Image Handling
    • How to handle embedded images
    • One of ["skip", "extract", "placeholder"]
    • Default value: "skip"
  6. Table Alignment
    • How to align table contents
    • One of ["left", "center", "preserve"]
    • Default value: "left"

Audio Processor

  • Supports mp3, mp4, and ogg formats
  • Automatically transcribes audio content
  • No additional configuration required

Managing Processors

Creating Processors

When creating a processor, consider:

  1. Which catalog it will serve
  2. What file type it needs to handle
  3. Whether you need a sub-file type
  4. Any specific configuration needed (especially for HTML)

Modifying Processors

  • Processors can be modified after creation
  • Changes don't automatically apply to already processed documents
  • Documents must be manually reprocessed to apply new processor settings

Processor Tuning

The tuning feature is available for implementation teams to:

  • Generate additional processing logs
  • Gain insights into exact processing behavior
  • Fine-tune processor behavior for specific contexts

Note: Tuning logs are only accessible to implementation teams, not end users.

Best Practices

  1. Plan Your Structure

    • Define clear sub-file types based on document characteristics
    • Consider document structure when configuring HTML processors
  2. HTML Processing

    • Start with broader included elements and narrow down
    • Test configuration with representative documents
    • Use excluded classes for fine-grained control
  3. General Tips

    • Keep file sizes under the 50MB limit
    • Document your processor configurations
    • Plan for periodic review of processor settings

Future Extensions

The processor framework is designed for extensibility:

  • Additional processor types planned
  • Metadata fields ready for future features
  • Configuration options may be expanded