- Change manifests for Prometheus installation
- Change instructions for deploying Prometheus stack and Pushgateway - Additional grouping to pushgateway to avoid overwriting of metrics in different pods / processes - Bugfix to ensure good retrieval of css en js files in eveai_app
This commit is contained in:
79
documentation/PUSHGATEWAY_GROUPING.md
Normal file
79
documentation/PUSHGATEWAY_GROUPING.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Pushgateway Grouping Keys (instance, namespace, process)
|
||||
|
||||
Goal: prevent metrics pushed by different Pods or worker processes from overwriting each other, while keeping Prometheus/Grafana queries simple.
|
||||
|
||||
Summary of decisions
|
||||
- WORKER_ID source = OS process ID (PID)
|
||||
- Always include namespace in grouping labels
|
||||
|
||||
What this changes
|
||||
- Every push to Prometheus Pushgateway now includes a grouping_key with:
|
||||
- instance = POD_NAME (fallback to HOSTNAME, then "dev")
|
||||
- namespace = POD_NAMESPACE (fallback to ENVIRONMENT, then "dev")
|
||||
- process = WORKER_ID (fallback to current PID)
|
||||
- Prometheus will expose these as exported_instance, exported_namespace, and exported_process on the scraped series.
|
||||
|
||||
Code changes (already implemented)
|
||||
- common/utils/business_event.py
|
||||
- push_to_gateway(..., grouping_key={instance, namespace, process})
|
||||
- Safe fallbacks ensure dev/test (Podman) keeps working with no K8s-specific env vars.
|
||||
|
||||
Kubernetes manifests (already implemented)
|
||||
- All Deployments that push metrics set env vars via Downward API:
|
||||
- POD_NAME from metadata.name
|
||||
- POD_NAMESPACE from metadata.namespace
|
||||
- Files updated:
|
||||
- scaleway/manifests/base/applications/frontend/eveai-app/deployment.yaml
|
||||
- scaleway/manifests/base/applications/frontend/eveai-api/deployment.yaml
|
||||
- scaleway/manifests/base/applications/frontend/eveai-chat-client/deployment.yaml
|
||||
- scaleway/manifests/base/applications/backend/eveai-workers/deployment.yaml
|
||||
- scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml
|
||||
- scaleway/manifests/base/applications/backend/eveai-entitlements/deployment.yaml
|
||||
|
||||
No changes needed to secrets
|
||||
- PUSH_GATEWAY_HOST/PORT remain provided via eveai-secrets; code composes PUSH_GATEWAY_URL internally.
|
||||
|
||||
How to verify
|
||||
1) Pushgateway contains per-pod/process groups
|
||||
- Port-forward Pushgateway (namespace monitoring):
|
||||
- kubectl -n monitoring port-forward svc/monitoring-pushgateway-prometheus-pushgateway 9091:9091
|
||||
- Inspect:
|
||||
- curl -s http://127.0.0.1:9091/api/v1/metrics | jq '.[].labels'
|
||||
- You should see labels including job (your service), instance (pod), namespace, process (pid).
|
||||
|
||||
2) Prometheus shows the labels as exported_*
|
||||
- Port-forward Prometheus (namespace monitoring):
|
||||
- kubectl -n monitoring port-forward svc/monitoring-prometheus 9090:9090
|
||||
- Queries:
|
||||
- label_values(eveai_llm_calls_total, exported_instance)
|
||||
- label_values(eveai_llm_calls_total, exported_namespace)
|
||||
- label_values(eveai_llm_calls_total, exported_process)
|
||||
|
||||
PromQL query patterns
|
||||
- Hide per-process by aggregating away exported_process:
|
||||
- sum without(exported_process) (rate(eveai_llm_calls_total[5m])) by (exported_job, exported_instance, exported_namespace)
|
||||
- Service-level totals (hide instance and process):
|
||||
- sum without(exported_instance, exported_process) (rate(eveai_llm_calls_total[5m])) by (exported_job, exported_namespace)
|
||||
- Histogram example (p95 per service):
|
||||
- histogram_quantile(0.95, sum without(exported_process) (rate(eveai_llm_duration_seconds_bucket[5m])) by (le, exported_job, exported_namespace))
|
||||
|
||||
Dev/Test (Podman) behavior
|
||||
- No Kubernetes Downward API: POD_NAME/POD_NAMESPACE are not set.
|
||||
- Fallbacks used by the code:
|
||||
- instance = HOSTNAME if available, else "dev"
|
||||
- namespace = ENVIRONMENT if available, else "dev"
|
||||
- process = current PID
|
||||
- This guarantees no crashes and still avoids process-level overwrites.
|
||||
|
||||
Operational notes
|
||||
- Cardinality: adding process creates more series (one per worker). This is required to avoid data loss when multiple workers push concurrently. Dashboards should aggregate away exported_process unless you need per-worker detail.
|
||||
- Batch jobs (future): use the same grouping and consider delete_from_gateway on successful completion to remove stale groups for that job/instance/process.
|
||||
|
||||
Troubleshooting
|
||||
- If you still see overwriting:
|
||||
- Confirm that instance, namespace, and process all appear in Pushgateway JSON labels for each group.
|
||||
- Ensure that all pods set POD_NAME and POD_NAMESPACE (kubectl -n eveai-staging exec <pod> -- env | egrep "POD_NAME|POD_NAMESPACE").
|
||||
- Verify that your app processes run push_to_gateway through the shared business_event wrapper.
|
||||
|
||||
Change log reference
|
||||
- Implemented on 2025-09-26 by adding grouping_key in business_event push and env vars in Deployments.
|
||||
@@ -119,7 +119,7 @@ helm search repo prometheus-community/kube-prometheus-stack
|
||||
|
||||
#### Create Monitoring Values File
|
||||
|
||||
Create `scaleway/manifests/base/monitoring/prometheus-values.yaml`:
|
||||
Create `scaleway/manifests/base/monitoring/values-monitoring.yaml`:
|
||||
|
||||
#### Deploy Monitoring Stack
|
||||
|
||||
@@ -133,7 +133,8 @@ helm install monitoring prometheus-community/kube-prometheus-stack \
|
||||
# Install pushgateway
|
||||
helm install monitoring-pushgateway prometheus-community/prometheus-pushgateway \
|
||||
-n monitoring --create-namespace \
|
||||
--set serviceMonitor.enabled=true
|
||||
--set serviceMonitor.enabled=true \
|
||||
--set serviceMonitor.additionalLabels.release=monitoring
|
||||
|
||||
# Monitor deployment progress
|
||||
kubectl get pods -n monitoring -w
|
||||
|
||||
Reference in New Issue
Block a user