Files
eveAI/documentation/PUSHGATEWAY_GROUPING.md
Josako fa452e4934 - Change manifests for Prometheus installation
- Change instructions for deploying Prometheus stack and Pushgateway
- Additional grouping to pushgateway to avoid overwriting of metrics in different pods / processes
- Bugfix to ensure good retrieval of css en js files in eveai_app
2025-09-30 14:56:08 +02:00

4.2 KiB

Pushgateway Grouping Keys (instance, namespace, process)

Goal: prevent metrics pushed by different Pods or worker processes from overwriting each other, while keeping Prometheus/Grafana queries simple.

Summary of decisions

  • WORKER_ID source = OS process ID (PID)
  • Always include namespace in grouping labels

What this changes

  • Every push to Prometheus Pushgateway now includes a grouping_key with:
    • instance = POD_NAME (fallback to HOSTNAME, then "dev")
    • namespace = POD_NAMESPACE (fallback to ENVIRONMENT, then "dev")
    • process = WORKER_ID (fallback to current PID)
  • Prometheus will expose these as exported_instance, exported_namespace, and exported_process on the scraped series.

Code changes (already implemented)

  • common/utils/business_event.py
    • push_to_gateway(..., grouping_key={instance, namespace, process})
    • Safe fallbacks ensure dev/test (Podman) keeps working with no K8s-specific env vars.

Kubernetes manifests (already implemented)

  • All Deployments that push metrics set env vars via Downward API:
    • POD_NAME from metadata.name
    • POD_NAMESPACE from metadata.namespace
  • Files updated:
    • scaleway/manifests/base/applications/frontend/eveai-app/deployment.yaml
    • scaleway/manifests/base/applications/frontend/eveai-api/deployment.yaml
    • scaleway/manifests/base/applications/frontend/eveai-chat-client/deployment.yaml
    • scaleway/manifests/base/applications/backend/eveai-workers/deployment.yaml
    • scaleway/manifests/base/applications/backend/eveai-chat-workers/deployment.yaml
    • scaleway/manifests/base/applications/backend/eveai-entitlements/deployment.yaml

No changes needed to secrets

  • PUSH_GATEWAY_HOST/PORT remain provided via eveai-secrets; code composes PUSH_GATEWAY_URL internally.

How to verify

  1. Pushgateway contains per-pod/process groups

    • Port-forward Pushgateway (namespace monitoring):
      • kubectl -n monitoring port-forward svc/monitoring-pushgateway-prometheus-pushgateway 9091:9091
    • Inspect:
  2. Prometheus shows the labels as exported_*

    • Port-forward Prometheus (namespace monitoring):
      • kubectl -n monitoring port-forward svc/monitoring-prometheus 9090:9090
    • Queries:
      • label_values(eveai_llm_calls_total, exported_instance)
      • label_values(eveai_llm_calls_total, exported_namespace)
      • label_values(eveai_llm_calls_total, exported_process)

PromQL query patterns

  • Hide per-process by aggregating away exported_process:
    • sum without(exported_process) (rate(eveai_llm_calls_total[5m])) by (exported_job, exported_instance, exported_namespace)
  • Service-level totals (hide instance and process):
    • sum without(exported_instance, exported_process) (rate(eveai_llm_calls_total[5m])) by (exported_job, exported_namespace)
  • Histogram example (p95 per service):
    • histogram_quantile(0.95, sum without(exported_process) (rate(eveai_llm_duration_seconds_bucket[5m])) by (le, exported_job, exported_namespace))

Dev/Test (Podman) behavior

  • No Kubernetes Downward API: POD_NAME/POD_NAMESPACE are not set.
  • Fallbacks used by the code:
    • instance = HOSTNAME if available, else "dev"
    • namespace = ENVIRONMENT if available, else "dev"
    • process = current PID
  • This guarantees no crashes and still avoids process-level overwrites.

Operational notes

  • Cardinality: adding process creates more series (one per worker). This is required to avoid data loss when multiple workers push concurrently. Dashboards should aggregate away exported_process unless you need per-worker detail.
  • Batch jobs (future): use the same grouping and consider delete_from_gateway on successful completion to remove stale groups for that job/instance/process.

Troubleshooting

  • If you still see overwriting:
    • Confirm that instance, namespace, and process all appear in Pushgateway JSON labels for each group.
    • Ensure that all pods set POD_NAME and POD_NAMESPACE (kubectl -n eveai-staging exec -- env | egrep "POD_NAME|POD_NAMESPACE").
    • Verify that your app processes run push_to_gateway through the shared business_event wrapper.

Change log reference

  • Implemented on 2025-09-26 by adding grouping_key in business_event push and env vars in Deployments.