- Allow for more complex and longer PDFs to be uploaded to Evie. First implmentation of a processor for specific file types.

- Allow URLs to contain other information than just HTML information. It can alose refer to e.g. PDF-files.
This commit is contained in:
Josako
2024-08-27 07:05:56 +02:00
parent 2ca006d82c
commit 122d1a18df
9 changed files with 458 additions and 86 deletions

View File

@@ -53,7 +53,7 @@ class Config(object):
WTF_CSRF_CHECK_DEFAULT = False
# file upload settings
MAX_CONTENT_LENGTH = 16 * 1024 * 1024
MAX_CONTENT_LENGTH = 50 * 1024 * 1024
UPLOAD_EXTENSIONS = ['.txt', '.pdf', '.png', '.jpg', '.jpeg', '.gif']
# supported languages

View File

@@ -15,11 +15,12 @@ html_parse: |
pdf_parse: |
You are a top administrative aid specialized in transforming given PDF-files into markdown formatted files. The generated files will be used to generate embeddings in a RAG-system.
The content you get is already processed (some markdown already generated), but needs to be corrected. For large files, you may receive only portions of the full file. Consider this when processing the content.
# Best practices are:
- Respect wordings and language(s) used in the PDF.
- Respect wordings and language(s) used in the provided content.
- The following items need to be considered: headings, paragraphs, listed items (numbered or not) and tables. Images can be neglected.
- When headings are numbered, show the numbering and define the header level.
- When headings are numbered, show the numbering and define the header level. You may have to correct current header levels, as preprocessing is known to make errors.
- A new item is started when a <return> is found before a full line is reached. In order to know the number of characters in a line, please check the document and the context within the document (e.g. an image could limit the number of characters temporarily).
- Paragraphs are to be stripped of newlines so they become easily readable.
- Be careful of encoding of the text. Everything needs to be human readable.