# Pushgateway Grouping Keys (instance, namespace, process) Goal: prevent metrics pushed by different Pods or worker processes from overwriting each other, while keeping Prometheus/Grafana queries simple. Summary of decisions - WORKER_ID source = OS process ID (PID) - Always include namespace in grouping labels What this changes - Every push to Prometheus Pushgateway now includes a grouping_key with: - instance = POD_NAME (fallback to HOSTNAME, then "dev") - namespace = POD_NAMESPACE (fallback to ENVIRONMENT, then "dev") - process = WORKER_ID (fallback to current PID) - Prometheus will expose these as exported_instance, exported_namespace, and exported_process on the scraped series. Code changes (already implemented) - common/utils/business_event.py - push_to_gateway(..., grouping_key={instance, namespace, process}) - Safe fallbacks ensure dev/test (Podman) keeps working with no K8s-specific env vars. Kubernetes manifests (already implemented) - All Deployments that push metrics set env vars via Downward API: - POD_NAME from metadata.name - POD_NAMESPACE from metadata.namespace - Files updated: - scaleway/manifests/base/applications/frontend/eveai-app/deployment.yaml - scaleway/manifests/base/applications/frontend/eveai-api/deployment.yaml - scaleway/manifests/base/applications/frontend/eveai-chat-client/deployment.yaml - scaleway/manifests/base/applications/backend/eveai-workers/deployment.yaml - scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml - scaleway/manifests/base/applications/backend/eveai-entitlements/deployment.yaml No changes needed to secrets - PUSH_GATEWAY_HOST/PORT remain provided via eveai-secrets; code composes PUSH_GATEWAY_URL internally. How to verify 1) Pushgateway contains per-pod/process groups - Port-forward Pushgateway (namespace monitoring): - kubectl -n monitoring port-forward svc/monitoring-pushgateway-prometheus-pushgateway 9091:9091 - Inspect: - curl -s http://127.0.0.1:9091/api/v1/metrics | jq '.[].labels' - You should see labels including job (your service), instance (pod), namespace, process (pid). 2) Prometheus shows the labels as exported_* - Port-forward Prometheus (namespace monitoring): - kubectl -n monitoring port-forward svc/monitoring-prometheus 9090:9090 - Queries: - label_values(eveai_llm_calls_total, exported_instance) - label_values(eveai_llm_calls_total, exported_namespace) - label_values(eveai_llm_calls_total, exported_process) PromQL query patterns - Hide per-process by aggregating away exported_process: - sum without(exported_process) (rate(eveai_llm_calls_total[5m])) by (exported_job, exported_instance, exported_namespace) - Service-level totals (hide instance and process): - sum without(exported_instance, exported_process) (rate(eveai_llm_calls_total[5m])) by (exported_job, exported_namespace) - Histogram example (p95 per service): - histogram_quantile(0.95, sum without(exported_process) (rate(eveai_llm_duration_seconds_bucket[5m])) by (le, exported_job, exported_namespace)) Dev/Test (Podman) behavior - No Kubernetes Downward API: POD_NAME/POD_NAMESPACE are not set. - Fallbacks used by the code: - instance = HOSTNAME if available, else "dev" - namespace = ENVIRONMENT if available, else "dev" - process = current PID - This guarantees no crashes and still avoids process-level overwrites. Operational notes - Cardinality: adding process creates more series (one per worker). This is required to avoid data loss when multiple workers push concurrently. Dashboards should aggregate away exported_process unless you need per-worker detail. - Batch jobs (future): use the same grouping and consider delete_from_gateway on successful completion to remove stale groups for that job/instance/process. Troubleshooting - If you still see overwriting: - Confirm that instance, namespace, and process all appear in Pushgateway JSON labels for each group. - Ensure that all pods set POD_NAME and POD_NAMESPACE (kubectl -n eveai-staging exec -- env | egrep "POD_NAME|POD_NAMESPACE"). - Verify that your app processes run push_to_gateway through the shared business_event wrapper. Change log reference - Implemented on 2025-09-26 by adding grouping_key in business_event push and env vars in Deployments.