Files
eveAI/documentation/Production Setup/cluster-install.md
Josako cc47ce2d32 - Adaptation of the static url to be used.
- Solved problem of using pushgateway in the k8s cluster
2025-09-23 16:44:08 +02:00

935 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# EveAI Cluster Installation Guide (Updated for Modular Kustomize Setup)
## Prerequisites
### Required Tools
```bash
# Verify required tools are installed
kubectl version --client
kustomize version
helm version
# Configure kubectl for Scaleway cluster
scw k8s kubeconfig install <cluster-id>
kubectl cluster-info
```
### Scaleway Prerequisites
- Kubernetes cluster running
- Managed services configured (PostgreSQL, Redis, MinIO)
- Secrets stored in Scaleway Secret Manager:
- `eveai-app-keys`, `eveai-mistral`, `eveai-object-storage`, `eveai-tem`
- `eveai-openai`, `eveai-postgresql`, `eveai-redis`, `eveai-redis-certificate`
- Flexible IP address (LoadBalancer)
- Eerst een loadbalancer aanmaken met publiek IP
- Daarna de loadbalancer verwijderen maar flexible IPs behouden
- Dit externe IP is het IP adres dat moet worden verwerkt in ingress-values.yaml!
## CDN Setup (Bunny.net - Optional)
### Configure Pull Zone
- Create Pull zone: evie-staging
- Origin: https://[LoadBalancer-IP] (note HTTPS!) -> pas later in het proces gekend
- Host header: evie-staging.askeveai.com
- Force SSL: Enabled
- In the pull zone's Caching - General settings, ensure to disable 'Strip Response Cookies'
- Define edge rules for
- Redirecting the root
- Redirecting security urls
### Update DNS (eurodns) for CDN
- Change A-record to CNAME pointing to CDN endpoint
- Or update A-record to CDN IP
## New Modular Deployment Process
### Phase 1: Infrastructure Foundation
Deploy core infrastructure components in the correct order:
```bash
# 1. Deploy namespaces
kubectl apply -f scaleway/manifests/base/infrastructure/00-namespaces.yaml
# 2. Add NGINX Ingress Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
# 3. Deploy NGINX ingress controller via Helm
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--values scaleway/manifests/base/infrastructure/ingress-values.yaml
# 4. Wait for ingress controller to be ready
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=300s
# 5. Add cert-manager Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update
# 6. Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.3/cert-manager.crds.yaml
# 7. Deploy cert-manager via Helm
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--values scaleway/manifests/base/infrastructure/cert-manager-values.yaml
# 8. Wait for cert-manager to be ready
kubectl wait --namespace cert-manager \
--for=condition=ready pod \
--selector=app.kubernetes.io/name=cert-manager \
--timeout=300s
# 9. Deploy cluster issuers
kubectl apply -f scaleway/manifests/base/infrastructure/03-cluster-issuers.yaml
```
### Phase 2: Verification Infrastructure Components
```bash
# Verify ingress controller
kubectl get pods -n ingress-nginx
kubectl get svc -n ingress-nginx
# Verify cert-manager
kubectl get pods -n cert-manager
kubectl get clusterissuers
# Check LoadBalancer external IP
kubectl get svc -n ingress-nginx ingress-nginx-controller
```
### Phase 3: Monitoring Stack (Optional but Recommended)
#### Add Prometheus Community Helm Repository
```bash
# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Verify chart availability
helm search repo prometheus-community/kube-prometheus-stack
```
#### Create Monitoring Values File
Create `scaleway/manifests/base/monitoring/prometheus-values.yaml`:
#### Deploy Monitoring Stack
```bash
# Install complete monitoring stack via Helm
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values scaleway/manifests/base/monitoring/prometheus-values.yaml
# Install pushgateway
helm install monitoring-pushgateway prometheus-community/prometheus-pushgateway \
-n monitoring --create-namespace \
--set serviceMonitor.enabled=true
# Monitor deployment progress
kubectl get pods -n monitoring -w
# Wait until all pods show STATUS: Running
```
#### Verify Monitoring Deployment
```bash
# Check Helm release
helm list -n monitoring
# Verify all components are running
kubectl get all -n monitoring
# Check persistent volumes are created
kubectl get pvc -n monitoring
# Check ServiceMonitor CRDs are available (for application monitoring)
kubectl get crd | grep monitoring.coreos.com
```
#### Enable cert-manager Monitoring Integration
```bash
# Enable Prometheus monitoring in cert-manager now that ServiceMonitor CRDs exist
helm upgrade cert-manager jetstack/cert-manager \
--namespace cert-manager \
--set prometheus.enabled=true \
--set prometheus.servicemonitor.enabled=true \
--reuse-values
```
#### Access Monitoring Services
##### Grafana Dashboard
```bash
# Port forward to access Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
# Access via browser: http://localhost:3000
# Username: admin
# Password: admin123 (from values file)
```
##### Prometheus UI
```bash
# Port forward to access Prometheus
kubectl port-forward -n monitoring svc/monitoring-prometheus 9090:9090 &
# Access via browser: http://localhost:9090
# Check targets: http://localhost:9090/targets
```
#### Cleanup Commands (if needed)
If you need to completely remove monitoring for a fresh start:
```bash
# Uninstall Helm release
helm uninstall monitoring -n monitoring
# Remove namespace
kubectl delete namespace monitoring
# Remove any remaining cluster-wide resources
kubectl get clusterroles | grep monitoring | awk '{print $1}' | xargs -r kubectl delete clusterrole
kubectl get clusterrolebindings | grep monitoring | awk '{print $1}' | xargs -r kubectl delete clusterrolebinding
```
#### What we installed
With monitoring successfully deployed:
- Grafana provides pre-configured Kubernetes dashboards
- Prometheus collects metrics from all cluster components
- ServiceMonitor CRDs are available for application-specific metrics
- AlertManager handles alert routing and notifications
### Phase 4: Secrets
#### Stap 1: Installeer External Secrets Operator
```bash
# Add Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update
# Install External Secrets Operator
helm install external-secrets external-secrets/external-secrets \
--namespace external-secrets-system \
--create-namespace
# Verify installation
kubectl get pods -n external-secrets-system
# Check CRDs zijn geïnstalleerd
kubectl get crd | grep external-secrets
```
#### Stap 2: Maak Scaleway API credentials aan
Je hebt Scaleway API credentials nodig voor de operator:
```bash
# Create secret with Scaleway API credentials
kubectl create secret generic scaleway-credentials \
--namespace eveai-staging \
--from-literal=access-key="JOUW_SCALEWAY_ACCESS_KEY" \
--from-literal=secret-key="JOUW_SCALEWAY_SECRET_KEY"
```
**Note:** Je krijgt deze credentials via:
- Scaleway Console → Project settings → API Keys
- Of via `scw iam api-key list` als je de CLI gebruikt
#### Stap 3: Verifieer SecretStore configuratie
Verifieer bestand: `scaleway/manifests/base/secrets/clustersecretstore-scaleway.yaml`. Daar moet de juiste project ID worden ingevoerd.
#### Stap 4: Verifieer ExternalSecret resource
Verifieer bestand: `scaleway/manifests/base/secrets/eveai-external-secrets.yaml`
**Belangrijk:**
- Scaleway provider vereist `key: name:secret-name` syntax
- SSL/TLS certificaten kunnen niet via `dataFrom/extract` worden opgehaald
- Certificaten moeten via `data` sectie worden toegevoegd
#### Stap 5: Deploy secrets
```bash
# Deploy SecretStore
kubectl apply -f scaleway/manifests/base/secrets/clustersecretstore-scaleway.yaml
# Deploy ExternalSecret
kubectl apply -f scaleway/manifests/base/secrets/eveai-external-secrets.yaml
```
#### Stap 6: Verificatie
```bash
# Check ExternalSecret status
kubectl get externalsecrets -n eveai-staging
# Check of het Kubernetes secret is aangemaakt
kubectl get secret eveai-secrets -n eveai-staging
# Check alle keys in het secret
kubectl get secret eveai-secrets -n eveai-staging -o jsonpath='{.data}' | jq 'keys'
# Check specifieke waarde (base64 decoded)
kubectl get secret eveai-secrets -n eveai-staging -o jsonpath='{.data.DB_HOST}' | base64 -d
# Check ExternalSecret events voor troubleshooting
kubectl describe externalsecret eveai-external-secrets -n eveai-staging
```
#### Stap 7: Gebruik in deployment
Je kunt nu deze secrets gebruiken in de deployment van de applicatie services die deze nodig hebben (TODO):
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: eveai-app
namespace: eveai-staging
spec:
selector:
matchLabels:
app: eveai-app
template:
metadata:
labels:
app: eveai-app
spec:
containers:
- name: eveai-app
envFrom:
- secretRef:
name: eveai-secrets # Alle environment variables uit één secret
# Je Python code gebruikt gewoon environ.get('DB_HOST') etc.
```
#### Stap 8: Redis certificaat gebruiken in Python
Voor SSL Redis connecties met het certificaat:
```python
# Voorbeeld in je config.py
import tempfile
import ssl
import redis
from os import environ
class StagingConfig:
def __init__(self):
self.REDIS_CERT_DATA = environ.get('REDIS_CERT')
self.REDIS_BASE_URI = environ.get('REDIS_BASE_URI', 'redis://localhost:6379/0')
def create_redis_connection(self):
if self.REDIS_CERT_DATA:
# Schrijf certificaat naar tijdelijk bestand
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.pem') as f:
f.write(self.REDIS_CERT_DATA)
cert_path = f.name
# Redis connectie met SSL certificaat
return redis.from_url(
self.REDIS_BASE_URI,
ssl_cert_reqs=ssl.CERT_REQUIRED,
ssl_ca_certs=cert_path
)
else:
return redis.from_url(self.REDIS_BASE_URI)
# Gebruik voor session Redis
@property
def SESSION_REDIS(self):
return self.create_redis_connection()
```
#### Scaleway Secret Manager Vereisten
Voor deze setup moeten je secrets in Scaleway Secret Manager correct gestructureerd zijn:
**JSON secrets (eveai-postgresql, eveai-redis, etc.):**
```json
{
"DB_HOST": "your-postgres-host.rdb.fr-par.scw.cloud",
"DB_USER": "eveai_user",
"DB_PASS": "your-password",
"DB_NAME": "eveai_staging",
"DB_PORT": "5432"
}
```
**SSL/TLS Certificaat (eveai-redis-certificate):**
```
-----BEGIN CERTIFICATE-----
MIIDGTCCAgGg...z69LXyY=
-----END CERTIFICATE-----
```
#### Voordelen van deze setup
- **Automatische sync**: Secrets worden elke 5 minuten geüpdatet
- **Geen code wijzigingen**: Je `environ.get()` calls blijven werken
- **Secure**: Credentials zijn niet in manifests, alleen in cluster
- **Centralized**: Alle secrets in Scaleway Secret Manager
- **Auditable**: External Secrets Operator logt alle acties
- **SSL support**: TLS certificaten worden correct behandeld
#### File structuur
```
scaleway/manifests/base/secrets/
├── scaleway-secret-store.yaml
└── eveai-external-secrets.yaml
```
### Phase 5: TLS en Network setup
#### Deploy HTTP ACME ingress
Om het certificaat aan te maken, moet in de DNS-zone een A-record worden aangemaakt dat rechtstreeks naar het IP van de loadbalancer wijst.
We maken nog geen CNAME aan naar Bunny.net. Anders gaat bunny.net het ACME proces mogelijks onderbreken.
Om het certificaat aan te maken, moeten we een HTTP ACME ingress gebruiken. Anders kan het certificaat niet worden aangemaakt.
```
kubectl apply -f scaleway/manifests/base/networking/ingress-http-acme.yaml
```
Check of het certificaat is aangemaakt (READY moet true zijn):
```
kubectl get certificate evie-staging-tls -n eveai-staging
# of met meer detail
kubectl -n eveai-staging describe certificate evie-staging-tls
```
Dit kan even duren. Maar zodra het certificaat is aangemaakt, kan je de de https-only ingress opzetten:
#### Apply per-prefix headers (moet bestaan vóór de Ingress die ernaar verwijst)
```bash
kubectl apply -f scaleway/manifests/base/networking/headers-configmaps.yaml
```
#### Apply ingresses
```bash
kubectl apply -f scaleway/manifests/base/networking/ingress-https.yaml # alleen /verify
kubectl apply -f scaleway/manifests/base/networking/ingress-admin.yaml # /admin → eveai-app-service
kubectl apply -f scaleway/manifests/base/networking/ingress-api.yaml # /api → eveai-api-service
kubectl apply -f scaleway/manifests/base/networking/ingress-chat-client.yaml # /chat-client → eveai-chat-client-service
# Alternatief: via overlay (mits kustomization.yaml is bijgewerkt)
kubectl apply -k scaleway/manifests/overlays/staging/
```
Om bunny.net te gebruiken:
- Nu kan het CNAME-record dat verwijst naar de Bunny.net Pull zone worden aangemaakt.
- In bunny.net moet in de pull-zone worden verwezen naar de loadbalancer IP via het HTTPS-protocol.
### Phase 6: Verification Service
Deze service kan ook al in Phase 5 worden geïnstalleerd om te verifiëren of de volledige netwerkstack (over bunny, certificaat, ...) werkt.
```bash
# Deploy verification service
kubectl apply -k scaleway/manifests/base/applications/verification/
### Phase 7: Complete Staging Deployment
```bash
# Deploy everything using the staging overlay
kubectl apply -k scaleway/manifests/overlays/staging/
# Verify complete deployment
kubectl get all -n eveai-staging
kubectl get ingress -n eveai-staging
kubectl get certificates -n eveai-staging
```
### Verificatie commando's
Controleer ingresses en headers:
```bash
kubectl -n eveai-staging get ing
kubectl -n eveai-staging describe ing eveai-admin-ingress
kubectl -n eveai-staging describe ing eveai-api-ingress
kubectl -n eveai-staging describe ing eveai-chat-client-ingress
kubectl -n eveai-staging describe ing eveai-staging-ingress # bevat /verify
kubectl -n eveai-staging get cm eveai-admin-headers eveai-api-headers eveai-chat-headers -o yaml
```
- In elke prefix-Ingress moeten de annotations zichtbaar zijn: use-regex: true, rewrite-target: /$2, proxy-set-headers: eveai-staging/eveai--headers.
- In de ConfigMaps moet de key X-Forwarded-Prefix de juiste waarde hebben (/admin, /api, /chat-client).
End-to-end testen:
- https://evie-staging.askeveai.com/admin/login → loginpagina. In app-logs zie je PATH zonder /admin (door rewrite) maar URL met /admin.
- Na login: 302 Location: /admin/user/tenant_overview.
- API: https://evie-staging.askeveai.com/api/… → backend ontvangt pad zonder /api.
- Chat client: https://evie-staging.askeveai.com/chat-client/… → juiste service.
- Verify: https://evie-staging.askeveai.com/verify → ongewijzigd via ingress-https.yaml.
- Root: zolang Bunny rule niet actief is, geen automatische redirect op / (verwacht gedrag).
### Phase 7: Install PgAdmin Tool
#### Secret eveai-pgadmin-admin in Scaleway Secret Manager aanmaken (indien niet bestaat)
2 Keys:
- `PGADMIN_DEFAULT_EMAIL`: E-mailadres voor de admin
- `PGADMIN_DEFAULT_PASSWORD`: voor de admin
#### Secrets deployen
```bash
kubectl apply -f scaleway/manifests/base/tools/pgadmin/externalsecrets.yaml
# Check
kubectl get externalsecret -n tools
kubectl get secret -n tools | grep pgadmin
```
#### Helm chart toepassen
```bash
helm repo add runix https://helm.runix.net
helm repo update
helm install pgadmin runix/pgadmin4 \
-n tools \
--create-namespace \
-f scaleway/manifests/base/tools/pgadmin/values.yaml
# Check status
kubectl get pods,svc -n tools
kubectl logs -n tools deploy/pgadmin-pgadmin4 || true
```
#### Port Forward, Local Access
```bash
# Find the service name (often "pgadmin")
kubectl -n tools get svc
# Forward local port 8080 to service port 80
kubectl -n tools port-forward svc/pgadmin-pgadmin4 8080:80
# Browser: http://localhost:8080
# Login with PGADMIN_DEFAULT_EMAIL / PGADMIN_DEFAULT_PASSWORD (from eveai-pgadmin-admin)
```
### Phase 8: RedisInsight Tool Deployment
#### Installatie via kubectl (zonder Helm)
Gebruik een eenvoudig manifest met Deployment + Service + PVC in de `tools` namespace. Dit vermijdt externe chart repositories en extra authenticatie.
```bash
# Apply manifest (maakt namespace tools aan indien nodig)
kubectl apply -f scaleway/manifests/base/tools/redisinsight/redisinsight.yaml
# Controleer resources
kubectl -n tools get pods,svc,pvc
```
#### (Optioneel) ExternalSecrets voor gemak (eigenlijk niet nodig)
Indien je de Redis-credentials en CA-cert in namespace `tools` wil spiegelen (handig om het CA-bestand eenvoudig te exporteren en/of later provisioning te doen):
```bash
kubectl apply -f scaleway/manifests/base/tools/redisinsight/externalsecrets.yaml
kubectl -n tools get externalsecret
kubectl -n tools get secret | grep redisinsight
```
CA-bestand lokaal opslaan voor UI-upload (alleen nodig als je ExternalSecrets gebruikte):
```bash
kubectl -n tools get secret redisinsight-ca -o jsonpath='{.data.REDIS_CERT}' | base64 -d > /tmp/redis-ca.pem
```
#### Port Forward, Local Access
```bash
# RedisInsight v2 luistert op poort 5540
kubectl -n tools port-forward svc/redisinsight 5540:5540
# Browser: http://localhost:5540
```
#### UI: Redis verbinden
- Host: `172.16.16.2`
- Port: `6379`
- Auth: username `luke`, password uit secret (eveai-redis of redisinsight-redis)
- TLS: zet TLS aan en upload het CA-certificaat (PEM)
- Certificaatverificatie: omdat je via IP verbindt en geen hostname in het certificaat staat, kan strict verify falen. Zet dan "Verify server certificate"/"Check server identity" uit in de UI. Dit is normaal bij private networking via IP.
#### Troubleshooting
- Controleer pods, service en PVC in `tools`:
```bash
kubectl -n tools get pods,svc,pvc
```
- NetworkPolicies: indien actief, laat egress toe van `tools``172.16.16.2:6379`.
- TLS-issues via IP: zet verify uit of gebruik een DNS-hostnaam die met het cert overeenkomt (indien beschikbaar).
- PVC niet bound: specificeer een geldige `storageClassName` in het manifest.
### Phase 9: Application Services Deployment
#### Create Scaleway Registry Secret
Create docker pull secret via External Secrets (once):
```bash
kubectl apply -f scaleway/manifests/base/secrets/scaleway-registry-secret.yaml
kubectl -n eveai-staging get secret scaleway-registry-cred -o yaml | grep "type: kubernetes.io/dockerconfigjson"
```
#### Ops Jobs Invocation (if required)
Run the DB ops scripts manually in order. Each manifest uses generateName; use kubectl create.
```bash
kubectl create -f scaleway/manifests/base/applications/ops/jobs/00-env-check-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=env-check --timeout=600s
kubectl create -f scaleway/manifests/base/applications/ops/jobs/02-db-bootstrap-ext-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-bootstrap-ext --timeout=1800s
kubectl create -f scaleway/manifests/base/applications/ops/jobs/03-db-migrate-public-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-migrate-public --timeout=1800s
kubectl create -f scaleway/manifests/base/applications/ops/jobs/04-db-migrate-tenant-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-migrate-tenant --timeout=3600s
kubectl create -f scaleway/manifests/base/applications/ops/jobs/05-seed-or-init-data-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-seed-or-init --timeout=1800s
kubectl create -f scaleway/manifests/base/applications/ops/jobs/06-verify-minimal-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-verify-minimal --timeout=900s
```
View logs (you can see the created job name as a result from the create command):
```bash
kubectl -n eveai-staging get jobs
kubectl -n eveai-staging logs job/<created-job-name>
```
#### Creating volume for eveai_chat_worker's crewai storage
```bash
kubectl apply -n eveai-staging -f scaleway/manifests/base/applications/backend/eveai-chat-workers/pvc.yaml
```
#### Application Services Deployment
Use the staging overlay to deploy apps with registry rewrite and imagePullSecrets:
```bash
kubectl apply -k scaleway/manifests/overlays/staging/
```
##### Deploy backend workers
```bash
kubectl apply -k scaleway/manifests/base/applications/backend/
kubectl -n eveai-staging get deploy | egrep 'eveai-(workers|chat-workers|entitlements)'
# Optional: quick logs
kubectl -n eveai-staging logs deploy/eveai-workers --tail=100 || true
kubectl -n eveai-staging logs deploy/eveai-chat-workers --tail=100 || true
kubectl -n eveai-staging logs deploy/eveai-entitlements --tail=100 || true
```
##### Deploy frontend services
```bash
kubectl apply -k scaleway/manifests/base/applications/frontend/
kubectl -n eveai-staging get deploy,svc | egrep 'eveai-(app|api|chat-client)'
```
##### Verify Ingress routes (Ingress managed separately)
Ingress is intentionally not managed by the staging Kustomize overlay. Apply or update it manually using your existing manifest and handle it per your cluster-install.md guide:
```bash
kubectl apply -f scaleway/manifests/base/networking/ingress-https.yaml
kubectl -n eveai-staging describe ingress eveai-staging-ingress
```
Then verify the routes:
```bash
curl -k https://evie-staging.askeveai.com/verify/health
curl -k https://evie-staging.askeveai.com/admin/healthz/ready
curl -k https://evie-staging.askeveai.com/api/healthz/ready
curl -k https://evie-staging.askeveai.com/client/healthz/ready
```
#### Updating the staging deployment
- Als je de images met dezelfde tag (bijv. :staging) opnieuw hebt gepusht én je staging pods gebruiken imagePullPolicy: Always (zoals in de handleiding), dan hoef je alleen een rollout te triggeren zodat de pods opnieuw starten en de nieuwste image pullen.
- Doe dit in de juiste namespace (waarschijnlijk eveai-staging) met kubectl rollout restart.
##### Snelste manier (alle deployments in één keer)
```bash
# Staging namespace (pas aan als je een andere gebruikt)
kubectl -n eveai-staging rollout restart deployment
# Optioneel: status volgen totdat alles klaar is
kubectl -n eveai-staging rollout status deploy --all
# Controleren welke image draait per pod
kubectl -n eveai-staging get pods -o=jsonpath='{range .items[*]}{@.metadata.name}{"\t"}{range .spec.containers[*]}{@.image}{" "}{end}{"\n"}{end}'
```
Dit herstart alle Deployments in de namespace. Omdat imagePullPolicy: Always staat, zal Kubernetes de nieuwste image voor de gebruikte tag (bijv. :staging) ophalen.
##### Specifieke services opnieuw starten
Wil je alleen bepaalde services restarten:
```bash
kubectl -n eveai-staging rollout restart deployment/eveai-app
kubectl -n eveai-staging rollout restart deployment/eveai-api
kubectl -n eveai-staging rollout restart deployment/eveai-chat-client
kubectl -n eveai-staging rollout restart deployment/eveai-workers
kubectl -n eveai-staging rollout restart deployment/eveai-chat-workers
kubectl -n eveai-staging rollout restart deployment/eveai-entitlements
kubectl -n eveai-staging rollout status deployment/eveai-app
```
##### Alternatief: (her)apply van manifesten
De handleiding plaatst de manifests in scaleway/manifests en beschrijft het gebruik van Kustomize overlays. Je kunt ook simpelweg opnieuw apply-en:
```bash
# Overlay die images herschrijft naar de Scaleway registry en imagePullSecrets toevoegt
kubectl apply -k scaleway/manifests/overlays/staging/
# Backend en frontend (indien je base afzonderlijk gebruikt)
kubectl apply -k scaleway/manifests/base/applications/backend/
kubectl apply -k scaleway/manifests/base/applications/frontend/
```
Let op: apply alleen triggert niet altijd een rollout als er geen inhoudelijke spec-wijziging is. Combineer dit zo nodig met een rollout restart zoals hierboven.
##### Als je met versie-tags werkt (productie-achtig)
- Gebruik je géén channel tag (:staging/:production) maar een vaste, versiegebonden tag (bijv. :v1.2.3) en imagePullPolicy: IfNotPresent, dan moet je óf:
- de tag in je manifest/overlay aanpassen naar de nieuwe versie en opnieuw apply-en, of
- met een eenmalige set-image een nieuwe ReplicaSet forceren:
```bash
kubectl -n eveai-staging set image deploy/eveai-api eveai-api=rg.fr-par.scw.cloud/<namespace>/josakola/eveai-api:v1.2.4
kubectl -n eveai-staging rollout status deploy/eveai-api
```
##### Troubleshooting
- Check of de registry pull secret aanwezig is (volgens handleiding):
```bash
kubectl apply -f scaleway/manifests/base/secrets/scaleway-registry-secret.yaml
kubectl -n eveai-staging get secret scaleway-registry-cred
```
- Bekijk events/logs als pods niet up komen:
```bash
kubectl get events -n eveai-staging --sort-by=.lastTimestamp
kubectl -n eveai-staging describe pod <pod-naam>
kubectl -n eveai-staging logs deploy/eveai-api --tail=200
```
### Phase 10: Cockpit Setup
#### Standard Cockpit Setup
- Create a grafana user (Cockpit > Grafana Users > Add user)
- Open Grafana Dashboard (Cockpit > Open Dashboards)
- Er zijn heel wat dashboards beschikbaar.
- Kubernetes cluster overview (metrics)
- Kubernetes cluster logs (controlplane logs)
### Phase 11: Flower Setup
#### Overzicht
Flower is de Celery monitoring UI. We deployen Flower in de namespace `monitoring` via de bjw-s/app-template Helm chart. Er is geen Ingress; toegang gebeurt enkel lokaal via `kubectl port-forward`. Verbinding naar Redis gebruikt TLS met je private CA; hostnameverificatie staat uit omdat je via IP verbindt.
#### Helm repository toevoegen
```bash
helm repo add bjw-s https://bjw-s-labs.github.io/helm-charts
helm repo update
helm search repo bjw-s/app-template
```
#### Deploy (aanbevolen: alleen Flower via Helm CLI)
Gebruik gerichte commandos zodat enkel Flower wordt beheerd door Helm en de rest van de monitoring stack ongemoeid blijft.
```bash
# 1) ExternalSecrets en NetworkPolicy aanmaken
kubectl apply -f scaleway/manifests/base/monitoring/flower/externalsecrets.yaml
kubectl apply -f scaleway/manifests/base/monitoring/flower/networkpolicy.yaml
# 2) Flower installeren via Helm (alleen deze release)
helm upgrade --install flower bjw-s/app-template \
-n monitoring --create-namespace \
-f scaleway/manifests/base/monitoring/flower/values.yaml
```
Wat dit deployt:
- ExternalSecrets: `flower-redis` (REDIS_USER/PASS/URL/PORT) en `flower-ca` (REDIS_CERT) uit `scaleway-cluster-secret-store`
- Flower via Helm (bjw-s/app-template):
- Image: `mher/flower:2.0.1` (gepind)
- Start: `/usr/local/bin/celery --broker=$(BROKER) flower --address=0.0.0.0 --port=5555`
- TLS naar Redis met CA-mount op `/etc/ssl/redis/ca.pem` en `ssl_check_hostname=false`
- Hardened securityContext (non-root, read-only rootfs, capabilities drop)
- Probes en resource requests/limits
- Service: ClusterIP `flower` op poort 5555
- NetworkPolicy: ingress default-deny; egress enkel naar Redis (172.16.16.2:6379/TCP) en CoreDNS (53 TCP/UDP)
#### Verifiëren
```bash
# Helm release en resources
helm list -n monitoring
kubectl -n monitoring get externalsecret
kubectl -n monitoring get secret | grep flower
kubectl -n monitoring get deploy,po,svc | grep flower
kubectl -n monitoring logs deploy/flower --tail=200 || true
```
#### Toegang (port-forward)
```bash
kubectl -n monitoring port-forward svc/flower 5555:5555
# Browser: http://localhost:5555
```
#### Security & TLS
- Geen Ingress/extern verkeer; enkel port-forward.
- TLS naar Redis met CA-mount op `/etc/ssl/redis/ca.pem`.
- Omdat je Redis via IP aanspreekt, staat `ssl_check_hostname=false`.
- Strikte egress NetworkPolicy: update het IP indien je Redis IP verandert.
#### Troubleshooting
```bash
# Secrets en ExternalSecrets
kubectl -n monitoring describe externalsecret flower-redis
kubectl -n monitoring describe externalsecret flower-ca
# Pods & logs
kubectl -n monitoring get pods -l app=flower -w
kubectl -n monitoring logs deploy/flower --tail=200
# NetworkPolicy
kubectl -n monitoring describe networkpolicy flower-policy
```
#### Alternatief: Kustomize rendering (let op!)
Je kunt Flower ook via Kustomize renderen samen met de monitoring chart:
```bash
kubectl kustomize --enable-helm scaleway/manifests/base/monitoring | kubectl apply -f -
```
Let op: dit rendert en applyt álle resources in de monitoring Kustomization, inclusief de kube-prometheus-stack chart. Gebruik dit alleen als je bewust de volledige monitoring stack declaratief wil bijwerken.
#### Migratie & Opschonen
Als je eerder de losse Deployment/Service hebt gebruikt:
```bash
kubectl -n monitoring delete deploy flower --ignore-not-found
kubectl -n monitoring delete svc flower --ignore-not-found
```
## Verification and Testing
### Check Infrastructure Status
```bash
# Verify ingress controller
kubectl get pods -n ingress-nginx
kubectl describe service ingress-nginx-controller -n ingress-nginx
# Verify cert-manager
kubectl get pods -n cert-manager
kubectl get clusterissuers
# Check certificate status (may take a few minutes to issue)
kubectl describe certificate evie-staging-tls -n eveai-staging
```
### Test Services
```bash
# Get external IP from LoadBalancer
kubectl get svc -n ingress-nginx ingress-nginx-controller
# Test HTTPS access (replace with your domain)
curl -k https://evie-staging.askeveai.com/verify/health
curl -k https://evie-staging.askeveai.com/verify/info
# Test monitoring (if deployed)
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
# Access Grafana at http://localhost:3000 (admin/admin123)
```
## DNS Configuration
### Update DNS Records
- Create A-record pointing to LoadBalancer external IP
- Or set up CNAME if using CDN
### Test Domain Access
```bash
# Test domain resolution
nslookup evie-staging.askeveai.com
# Test HTTPS access via domain
curl https://evie-staging.askeveai.com/verify/
```
## EveAI Chat Workers: Persistent logs storage and Celery process behavior
This addendum describes how to enable persistent storage for CrewAI tuning runs under /app/logs for the eveai-chat-workers Deployment and clarifies Celery process behavior relevant to environment variables.
### Celery prefork behavior and env variables
- Pool: prefork (default). Each worker process (child) handles multiple tasks sequentially.
- Implication: any environment variable changed inside a child process persists for subsequent tasks handled by that same child, until it is changed again or the process is recycled.
- Our practice: set required env vars (e.g., CREWAI_STORAGE_DIR/CREWAI_STORAGE_PATH) immediately before initializing CrewAI and restore them immediately after. This prevents leakage to the next task in the same process.
- CELERY_MAX_TASKS_PER_CHILD: the number of tasks a child will process before being recycled. Suggested starting range for heavy LLM/RAG workloads: 200500; 1000 is acceptable if memory growth is stable. Monitor RSS and adjust.
### Create and mount a PersistentVolumeClaim for /app/logs
We persist tuning outputs under /app/logs by mounting a PVC in the worker pod.
Manifests added/updated (namespace: eveai-staging):
- scaleway/manifests/base/applications/backend/eveai-chat-workers/pvc.yaml
- scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml (volume mount added)
Apply with kubectl (no Kustomize required):
```bash
# Create or update the PVC for logs
kubectl apply -n eveai-staging -f scaleway/manifests/base/applications/backend/eveai-chat-workers/pvc.yaml
# Update the Deployment to mount the PVC at /app/logs
kubectl apply -n eveai-staging -f scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml
```
Verify PVC is bound and the pod mounts the volume:
```bash
# Check PVC status
kubectl get pvc -n eveai-staging eveai-chat-workers-logs -o wide
# Inspect the pod to confirm the volume mount
kubectl get pods -n eveai-staging -l app=eveai-chat-workers -o name
kubectl describe pod -n eveai-staging <pod-name>
# (Optional) Exec into the pod to check permissions and path
kubectl exec -n eveai-staging -it <pod-name> -- sh -lc 'id; ls -ld /app/logs'
```
Permissions and securityContext notes:
- The container runs as a non-root user (appuser) per Dockerfile.base. Some storage classes mount volumes owned by root. If you encounter permission issues (EACCES) writing to /app/logs:
- Option A: set a pod-level fsGroup so the mounted volume is group-writable by the container user.
- Option B: use an initContainer to chown/chmod /app/logs on the mounted volume.
- Keep monitoring PVC usage and set alerts to avoid running out of space.
Retention / cleanup recommendation:
- For a 14-day retention, create a CronJob that runs daily to remove files older than 14 days and then delete empty directories, mounting the same PVC at /app/logs. Example command:
```bash
find /app/logs -type f -mtime +14 -print -delete; find /app/logs -type d -empty -mtime +14 -print -delete
```
Operational checks after deployment:
1) Trigger a CrewAI tuning run; verify files appear under /app/logs and remain after pod restarts.
2) Trigger a non-tuning run; verify temporary directories are created and cleaned up automatically.
3) Monitor memory while varying CELERY_CONCURRENCY and CELERY_MAX_TASKS_PER_CHILD.