Files

Josako af8b5f54cd - Definition and Improvements to job-system

- Definition of k8s pods for application services

2025-09-04 11:49:19 +02:00

8.8 KiB

Raw Permalink Blame History

Phase 8: Application Services (Staging)

This guide describes how to deploy EveAI application services to the Scaleway Kubernetes cluster, building on Phases 1–7 in cluster-install.md.

Prerequisites

Ingress-NGINX running with external IP
cert-manager installed and Certificate evie-staging-tls is READY (via HTTP ACME first, then HTTPS-only)
External Secrets Operator installed; Kubernetes Secret eveai-secrets exists in namespace eveai-staging
Verification service deployed and reachable via /verify
Optional: Monitoring stack running, Pushgateway deployed or reachable; PUSH_GATEWAY_HOST/PORT available to apps (via eveai-secrets)

What we deploy (structure)

Frontend (web) services
- eveai-app → exposed at /admin
- eveai-api → exposed at /api
- eveai-chat-client → exposed at /client
Backend worker services (internal)
- eveai-workers (queue: embeddings)
- eveai-chat-workers (queue: llm_interactions)
- eveai-entitlements (queue: entitlements)
Ops Jobs (manual DB ops)
- 00-env-check
- 02-db-bootstrap-ext
- 03-db-migrate-public
- 04-db-migrate-tenant
- 05-seed-or-init-data
- 06-verify-minimal

Manifests are under:

scaleway/manifests/base/applications/frontend/
scaleway/manifests/base/applications/backend/
scaleway/manifests/base/applications/ops/jobs/
Aggregate kustomization (apps only): scaleway/manifests/base/applications/kustomization.yaml

Note:

The staging Kustomize overlay deploys only frontend and backend apps.
Ingress remains managed manually via scaleway/manifests/base/networking/ingress-https.yaml and your cluster-install.md guide.
Ops Jobs are not part of the overlay and should be executed manually with kubectl create -f.

Step 1: Validate secrets

kubectl get secret eveai-secrets -n eveai-staging
kubectl get secret eveai-secrets -n eveai-staging -o jsonpath='{.data}' | jq 'keys'

Confirm presence of DB_, REDIS_, OPENAI_API_KEY, MISTRAL_API_KEY, JWT_SECRET_KEY, API_ENCRYPTION_KEY, MINIO_*, PUSH_GATEWAY_HOST, PUSH_GATEWAY_PORT.

Step 2: Deploy Ops Jobs (manual pre-deploy)

Run the DB ops scripts manually in order. Each manifest uses generateName; use kubectl create.

Notes for images:

Ops Jobs now reference the private Scaleway registry directly and set imagePullPolicy: Always.
Ensure the docker pull secret exists (scaleway-registry-cred) — see the Private registry section.
After pushing a new :staging image, delete any previous Job (if present) and create a new one to force a fresh Pod pull.

kubectl create -f scaleway/manifests/base/applications/ops/jobs/00-env-check-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=env-check --timeout=600s

kubectl create -f scaleway/manifests/base/applications/ops/jobs/02-db-bootstrap-ext-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-bootstrap-ext --timeout=1800s

kubectl create -f scaleway/manifests/base/applications/ops/jobs/03-db-migrate-public-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-migrate-public --timeout=1800s

kubectl create -f scaleway/manifests/base/applications/ops/jobs/04-db-migrate-tenant-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-migrate-tenant --timeout=3600s

kubectl create -f scaleway/manifests/base/applications/ops/jobs/05-seed-or-init-data-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-seed-or-init --timeout=1800s

kubectl create -f scaleway/manifests/base/applications/ops/jobs/06-verify-minimal-job.yaml
kubectl wait --for=condition=complete job -n eveai-staging -l job-type=db-verify-minimal --timeout=900s

View logs:

kubectl -n eveai-staging get jobs
kubectl -n eveai-staging logs job/<created-job-name>

Runtime environment for Ops Jobs

Each Ops Job sets the same non-secret runtime variables required by the shared bootstrap (start.sh/run.py):

FLASK_APP=/app/scripts/run.py
COMPONENT_NAME=eveai_ops
PYTHONUNBUFFERED=1
LOGLEVEL=debug (for staging)
ROLE=web
PORT=8080
WORKERS=1
WORKER_CLASS=gevent
WORKER_CONN=100
MAX_REQUESTS=1000
MAX_REQUESTS_JITTER=100

Secrets (DB_, REDIS_, etc.) still come from envFrom: secretRef: eveai-secrets.

Tip: After pushing a new :staging image, delete any previous Job with the same label to force a fresh Pod and pull:

kubectl -n eveai-staging delete job -l component=ops,job-type=db-migrate-public || true
kubectl create -f scaleway/manifests/base/applications/ops/jobs/03-db-migrate-public-job.yaml

Step 3: Deploy backend workers

kubectl apply -k scaleway/manifests/base/applications/backend/

kubectl -n eveai-staging get deploy | egrep 'eveai-(workers|chat-workers|entitlements)'
# Optional: quick logs
kubectl -n eveai-staging logs deploy/eveai-workers --tail=100 || true
kubectl -n eveai-staging logs deploy/eveai-chat-workers --tail=100 || true
kubectl -n eveai-staging logs deploy/eveai-entitlements --tail=100 || true

Step 4: Deploy frontend services

kubectl apply -k scaleway/manifests/base/applications/frontend/

kubectl -n eveai-staging get deploy,svc | egrep 'eveai-(app|api|chat-client)'

Step 5: Verify Ingress routes (Ingress managed separately)

Ingress is intentionally not managed by the staging Kustomize overlay. Apply or update it manually using your existing manifest and handle it per your cluster-install.md guide:

kubectl apply -f scaleway/manifests/base/networking/ingress-https.yaml
kubectl -n eveai-staging describe ingress eveai-staging-ingress

Then verify the routes:

curl -k https://evie-staging.askeveai.com/verify/health
curl -k https://evie-staging.askeveai.com/admin/healthz/ready
curl -k https://evie-staging.askeveai.com/api/healthz/ready
curl -k https://evie-staging.askeveai.com/client/healthz/ready

Resources and probes (staging defaults)

Web (app, api, chat-client):
- requests: 150m CPU, 256Mi RAM; limits: 500m CPU, 512Mi RAM; replicas: 1
- readiness/liveness: GET /healthz/ready
Workers:
- eveai-workers: req 200m/512Mi, lim 1CPU/1Gi
- eveai-chat-workers: req 500m/1Gi, lim 2CPU/3Gi
- eveai-entitlements: req 100m/256Mi, lim 500m/512Mi

Pushgateway usage

Ensure PUSH_GATEWAY_HOST and PUSH_GATEWAY_PORT are provided (e.g., pushgateway.monitoring.svc.cluster.local:9091), typically via eveai-secrets or a ConfigMap.
Apps will continue to push business metrics; Prometheus scrapes the Pushgateway.

Image tags strategy (staging/production channels)

The push script now creates and pushes two tags per service:
- A versioned tag: :vX.Y.Z (e.g., :v1.2.3)
- An environment channel tag based on ENVIRONMENT: :staging or :production
Recommendation for staging manifests:
- Refer to the channel tag (e.g., rg.fr-par.scw.cloud/eveai-staging/...:) and set imagePullPolicy: Always so new pushes are picked up without manifest changes.
Production can later use immutable version tags or digests via a production overlay.
Ensure PUSH_GATEWAY_HOST and PUSH_GATEWAY_PORT are provided (e.g., pushgateway.monitoring.svc.cluster.local:9091), typically via eveai-secrets or a ConfigMap.
Apps will continue to push business metrics; Prometheus scrapes the Pushgateway.

Bunny.net WAF (TODO)

Configure Pull Zone for evie-staging.askeveai.com
Set Origin to the LoadBalancer IP with HTTPS and Host header evie-staging.askeveai.com
Define rate limits primarily on /api, looser on /client; enable bot filtering
Only switch DNS (CNAME) to Bunny after TLS issuance completed directly against LoadBalancer

Troubleshooting

kubectl get all -n eveai-staging
kubectl get events -n eveai-staging --sort-by=.lastTimestamp
kubectl describe ingress eveai-staging-ingress -n eveai-staging
kubectl logs -n eveai-staging deploy/eveai-api --tail=200

Rollback / Cleanup

# Remove frontend/backend (keeps verification and other base resources)
kubectl delete -k scaleway/manifests/base/applications/frontend/
kubectl delete -k scaleway/manifests/base/applications/backend/

# Jobs are kept for history due to ttlSecondsAfterFinished; to delete immediately:
kubectl -n eveai-staging delete jobs --all

Private registry (Scaleway)

Create docker pull secret via External Secrets (once):

kubectl apply -f scaleway/manifests/base/secrets/scaleway-registry-secret.yaml
kubectl -n eveai-staging get secret scaleway-registry-cred -o yaml | grep "type: kubernetes.io/dockerconfigjson"

Use the staging overlay to deploy apps with registry rewrite and imagePullSecrets:

kubectl apply -k scaleway/manifests/overlays/staging/

Notes:

Base manifests keep generic images (josakola/...). The overlay rewrites them to rg.fr-par.scw.cloud/eveai-staging/josakola/...:staging and adds imagePullSecrets to all Pods.
Staging uses imagePullPolicy: Always, so new pushes to :staging are pulled automatically.

8.8 KiB Raw Permalink Blame History Unescape Escape