- eveai_chat_client updated to retrieve static files from the correct (bunny.net) location when a STATIC_URL is defined.

- Defined locations for crewai crew memory. This failed in k8s. - Redis connection for pub/sub in ExecutionProgressTracker adapted to conform to TLS-enabled connections
2025-09-12 10:18:43 +02:00
parent a325fa5084
commit 42cb1de0fd
15 changed files with 306 additions and 50 deletions
--- a/Setup/cluster-install.md
+++ b/Setup/cluster-install.md
@@ -612,6 +612,12 @@ kubectl -n eveai-staging get jobs
 kubectl -n eveai-staging logs job/<created-job-name>
 ```

+#### Creating volume for eveai_chat_worker's crewai storage
+
+```bash
+kubectl apply -n eveai-staging -f scaleway/manifests/base/applications/backend/eveai-chat-workers/pvc.yaml
+```
+
 #### Application Services Deployment
 Use the staging overlay to deploy apps with registry rewrite and imagePullSecrets:
 ```bash
@@ -861,3 +867,63 @@ curl https://evie-staging.askeveai.com/verify/



+
+
+## EveAI Chat Workers: Persistent logs storage and Celery process behavior
+
+This addendum describes how to enable persistent storage for CrewAI tuning runs under /app/logs for the eveai-chat-workers Deployment and clarifies Celery process behavior relevant to environment variables.
+
+### Celery prefork behavior and env variables
+- Pool: prefork (default). Each worker process (child) handles multiple tasks sequentially.
+- Implication: any environment variable changed inside a child process persists for subsequent tasks handled by that same child, until it is changed again or the process is recycled.
+- Our practice: set required env vars (e.g., CREWAI_STORAGE_DIR/CREWAI_STORAGE_PATH) immediately before initializing CrewAI and restore them immediately after. This prevents leakage to the next task in the same process.
+- CELERY_MAX_TASKS_PER_CHILD: the number of tasks a child will process before being recycled. Suggested starting range for heavy LLM/RAG workloads: 200–500; 1000 is acceptable if memory growth is stable. Monitor RSS and adjust.
+
+### Create and mount a PersistentVolumeClaim for /app/logs
+We persist tuning outputs under /app/logs by mounting a PVC in the worker pod.
+
+Manifests added/updated (namespace: eveai-staging):
+- scaleway/manifests/base/applications/backend/eveai-chat-workers/pvc.yaml
+- scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml (volume mount added)
+
+Apply with kubectl (no Kustomize required):
+
+```bash
+# Create or update the PVC for logs
+kubectl apply -n eveai-staging -f scaleway/manifests/base/applications/backend/eveai-chat-workers/pvc.yaml
+
+# Update the Deployment to mount the PVC at /app/logs
+kubectl apply -n eveai-staging -f scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml
+```
+
+Verify PVC is bound and the pod mounts the volume:
+
+```bash
+# Check PVC status
+kubectl get pvc -n eveai-staging eveai-chat-workers-logs -o wide
+
+# Inspect the pod to confirm the volume mount
+kubectl get pods -n eveai-staging -l app=eveai-chat-workers -o name
+kubectl describe pod -n eveai-staging <pod-name>
+
+# (Optional) Exec into the pod to check permissions and path
+kubectl exec -n eveai-staging -it <pod-name> -- sh -lc 'id; ls -ld /app/logs'
+```
+
+Permissions and securityContext notes:
+- The container runs as a non-root user (appuser) per Dockerfile.base. Some storage classes mount volumes owned by root. If you encounter permission issues (EACCES) writing to /app/logs:
+  - Option A: set a pod-level fsGroup so the mounted volume is group-writable by the container user.
+  - Option B: use an initContainer to chown/chmod /app/logs on the mounted volume.
+- Keep monitoring PVC usage and set alerts to avoid running out of space.
+
+Retention / cleanup recommendation:
+- For a 14-day retention, create a CronJob that runs daily to remove files older than 14 days and then delete empty directories, mounting the same PVC at /app/logs. Example command:
+
+```bash
+find /app/logs -type f -mtime +14 -print -delete; find /app/logs -type d -empty -mtime +14 -print -delete
+```
+
+Operational checks after deployment:
+1) Trigger a CrewAI tuning run; verify files appear under /app/logs and remain after pod restarts.
+2) Trigger a non-tuning run; verify temporary directories are created and cleaned up automatically.
+3) Monitor memory while varying CELERY_CONCURRENCY and CELERY_MAX_TASKS_PER_CHILD.