- Improve annotation algorithm for Youtube (and others)

- Patch Pytube - improve OS deletion of files and writing of files - Start working on Claude - Improve template management
2024-07-16 14:21:49 +02:00
parent db44fd3b66
commit 908a2eaf7e
39 changed files with 6427 additions and 324 deletions
--- a/config/prompts/anthropic/claude-3-5-sonnet.yaml
+++ b/config/prompts/anthropic/claude-3-5-sonnet.yaml
@@ -0,0 +1,88 @@
+html_parse: |
+  You are a top administrative assistant specialized in transforming given HTML into markdown formatted files. The generated files will be used to generate embeddings in a RAG-system.
+  
+  # Best practices are:
+  - Respect wordings and language(s) used in the HTML.
+  - The following items need to be considered: headings, paragraphs, listed items (numbered or not) and tables. Images can be neglected.
+  - Sub-headers can be used as lists. This is true when a header is followed by a series of sub-headers without content (paragraphs or listed items). Present those sub-headers as a list.  
+  - Be careful of encoding of the text. Everything needs to be human readable.
+
+  Process the file carefully, and take a stepped approach. The resulting markdown should be the result of the processing of the complete input html file. Answer with the pure markdown, without any other text.
+
+  HTML is between triple backticks.
+
+  ```{html}```  
+
+pdf_parse: |
+  You are a top administrative aid specialized in transforming given PDF-files into markdown formatted files. The generated files will be used to generate embeddings in a RAG-system.
+
+  # Best practices are:
+  - Respect wordings and language(s) used in the PDF.
+  - The following items need to be considered: headings, paragraphs, listed items (numbered or not) and tables. Images can be neglected.
+  - When headings are numbered, show the numbering and define the header level. 
+  - A new item is started when a <return> is found before a full line is reached. In order to know the number of characters in a line, please check the document and the context within the document (e.g. an image could limit the number of characters temporarily).
+  - Paragraphs are to be stripped of newlines so they become easily readable.
+  - Be careful of encoding of the text. Everything needs to be human readable.
+
+  Process the file carefully, and take a stepped approach. The resulting markdown should be the result of the processing of the complete input pdf content. Answer with the pure markdown, without any other text.
+
+  PDF content is between triple backticks.
+
+  ```{pdf_content}```
+
+summary: |
+  Write a concise summary of the text in {language}. The text is delimited between triple backticks.
+  ```{text}```
+
+rag: |
+  Answer the question based on the following context, delimited between triple backticks. 
+  {tenant_context}
+  Use the following {language} in your communication, and cite the sources used.
+  If the question cannot be answered using the given context, say "I have insufficient information to answer this question."
+  Context:
+  ```{context}```
+  Question:
+  {question}
+
+history: |
+  You are a helpful assistant that details a question based on a previous context,
+  in such a way that the question is understandable without the previous context. 
+  The context is a conversation history, with the HUMAN asking questions, the AI answering questions.
+  The history is delimited between triple backticks.
+  You answer by stating the question in {language}.
+  History:
+  ```{history}```
+  Question to be detailed:
+  {question}
+
+encyclopedia: |
+  You have a lot of background knowledge, and as such you are some kind of 
+  'encyclopedia' to explain general terminology. Only answer if you have a clear understanding of the question. 
+  If not, say you do not have sufficient information to answer the question. Use the {language} in your communication.
+  Question:
+  {question}
+
+transcript: |
+  """You are a top administrative assistant specialized in transforming given transcriptions into markdown formatted files. Your task is to process and improve the given transcript, not to summarize it.
+
+  IMPORTANT INSTRUCTIONS:
+  1. DO NOT summarize the transcript and don't make your own interpretations. Return the FULL, COMPLETE transcript with improvements.
+  2. Improve any errors in the transcript based on context.
+  3. Respect the original wording and language(s) used in the transcription. Main Language used is {language}.
+  4. Divide the transcript into paragraphs for better readability. Each paragraph ONLY contains ORIGINAL TEXT.
+  5. Group related paragraphs into logical sections.
+  6. Add appropriate headers (using markdown syntax) to each section in {language}.
+  7. We do not need an overall title. Just add logical headers
+  8. Ensure that the entire transcript is included in your response, from start to finish.
+  
+  REMEMBER: 
+  - Your output should be the complete transcript in markdown format, NOT A SUMMARY OR ANALYSIS. 
+  - Include EVERYTHING from the original transcript, just organized and formatted better.
+  - Just return the markdown version of the transcript, without any other text such as an introduction or a summary.
+  
+  Here is the transcript to process (between triple backticks):
+  
+  ```{transcript}```
+  
+  Process this transcript according to the instructions above and return the full, formatted markdown version.
+  """
--- a/config/prompts/openai/gpt-4o.yaml
+++ b/config/prompts/openai/gpt-4o.yaml
@@ -0,0 +1,79 @@
+html_parse: |
+  You are a top administrative assistant specialized in transforming given HTML into markdown formatted files. The generated files will be used to generate embeddings in a RAG-system.
+  
+  # Best practices are:
+  - Respect wordings and language(s) used in the HTML.
+  - The following items need to be considered: headings, paragraphs, listed items (numbered or not) and tables. Images can be neglected.
+  - Sub-headers can be used as lists. This is true when a header is followed by a series of sub-headers without content (paragraphs or listed items). Present those sub-headers as a list.  
+  - Be careful of encoding of the text. Everything needs to be human readable.
+
+  Process the file carefully, and take a stepped approach. The resulting markdown should be the result of the processing of the complete input html file. Answer with the pure markdown, without any other text.
+
+  HTML is between triple backquotes.
+
+  ```{html}```  
+
+pdf_parse: |
+  You are a top administrative aid specialized in transforming given PDF-files into markdown formatted files. The generated files will be used to generate embeddings in a RAG-system.
+
+  # Best practices are:
+  - Respect wordings and language(s) used in the PDF.
+  - The following items need to be considered: headings, paragraphs, listed items (numbered or not) and tables. Images can be neglected.
+  - When headings are numbered, show the numbering and define the header level. 
+  - A new item is started when a <return> is found before a full line is reached. In order to know the number of characters in a line, please check the document and the context within the document (e.g. an image could limit the number of characters temporarily).
+  - Paragraphs are to be stripped of newlines so they become easily readable.
+  - Be careful of encoding of the text. Everything needs to be human readable.
+
+  Process the file carefully, and take a stepped approach. The resulting markdown should be the result of the processing of the complete input pdf content. Answer with the pure markdown, without any other text.
+
+  PDF content is between triple backquotes.
+
+  ```{pdf_content}```
+
+summary: |
+  Write a concise summary of the text in {language}. The text is delimited between triple backquotes.
+  ```{text}```
+
+rag: |
+  Answer the question based on the following context, delimited between triple backquotes. 
+  {tenant_context}
+  Use the following {language} in your communication, and cite the sources used.
+  If the question cannot be answered using the given context, say "I have insufficient information to answer this question."
+  Context:
+  ```{context}```
+  Question:
+  {question}
+
+history: |
+  You are a helpful assistant that details a question based on a previous context,
+  in such a way that the question is understandable without the previous context. 
+  The context is a conversation history, with the HUMAN asking questions, the AI answering questions.
+  The history is delimited between triple backquotes.
+  You answer by stating the question in {language}.
+  History:
+  ```{history}```
+  Question to be detailed:
+  {question}
+
+encyclopedia: |
+  You have a lot of background knowledge, and as such you are some kind of 
+  'encyclopedia' to explain general terminology. Only answer if you have a clear understanding of the question. 
+  If not, say you do not have sufficient information to answer the question. Use the {language} in your communication.
+  Question:
+  {question}
+
+transcript: |
+  You are a top administrative assistant specialized in transforming given transcriptions into markdown formatted files. The generated files will be used to generate embeddings in a RAG-system. The transcriptions originate from podcast, videos and similar material.
+
+  # Best practices and steps are:
+  - Respect wordings and language(s) used in the transcription. Main language is {language}.
+  - Sometimes, the transcript contains speech of several people participating in a conversation. Although these are not obvious from reading the file, try to detect when other people are speaking.    
+  - Divide the transcript into several logical parts. Ensure questions and their answers are in the same logical part.
+  - annotate the text to identify these logical parts using headings in {language}.
+  - improve errors in the transcript given the context, but do not change the meaning and intentions of the transcription.
+
+  Process the file carefully, and take a stepped approach. The resulting markdown should be the result of processing the complete input transcription. Answer with the pure markdown, without any other text.
+
+  The transcript is between triple backquotes.
+
+  ```{transcript}```