pieter/eveAI

Fork 0

Files

Josako 84a9334c80 - Functional control plan

2025-08-18 11:44:23 +02:00

11 KiB

Raw Blame History

Containerd CRI Plugin Troubleshooting Guide

Datum: 18 augustus 2025
Auteur: EveAI Development Team
Versie: 1.0

Overzicht

Dit document beschrijft de oplossing voor een kritiek probleem met de containerd Container Runtime Interface (CRI) plugin in het EveAI Kubernetes development cluster. Het probleem verhinderde de succesvolle opstart van Kind clusters en resulteerde in niet-functionele Kubernetes nodes.

Probleem Beschrijving

Symptomen

Het EveAI development cluster ondervond de volgende problemen:

Kind cluster creatie faalde met complexe kubeadmConfigPatches
Control-plane nodes bleven in NotReady status
Container runtime toonde Unknown status
Kubelet kon niet communiceren met de container runtime
Ingress pods konden niet worden gescheduled
Cluster was volledig niet-functioneel

Foutmeldingen

Primaire Fout - Containerd CRI Plugin

failed to create CRI service: failed to create cni conf monitor for default: 
failed to create fsnotify watcher: too many open files

Kubelet Communicatie Fouten

rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService

Node Status Problemen

NAME                              STATUS     ROLES           AGE   VERSION
eveai-dev-cluster-control-plane   NotReady   control-plane   5m    v1.33.1

Root Cause Analyse

Hoofdoorzaak

Het probleem had twee hoofdcomponenten:

Complexe Kind Configuratie: De oorspronkelijke kind-dev-cluster.yaml bevatte complexe kubeadmConfigPatches en containerdConfigPatches die de cluster initialisatie verstoorden.
File Descriptor Limits: De containerd service kon geen fsnotify watcher aanmaken voor CNI configuratie monitoring vanwege "too many open files" beperkingen binnen de Kind container omgeving.

Technische Details

Kind Configuratie Problemen

De oorspronkelijke configuratie bevatte:

kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    etcd:
      local:
        dataDir: /tmp/lib/etcd
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
        authorization-mode: "Webhook"
        feature-gates: "EphemeralContainers=true"

Containerd CRI Plugin Failure

De containerd service startte wel op, maar de CRI plugin faalde tijdens het laden:

Service Status: active (running)
CRI Plugin: failed to load
Gevolg: Kubelet kon niet communiceren met container runtime

Oplossing Implementatie

Stap 1: Kind Configuratie Vereenvoudiging

Probleem: Complexe kubeadmConfigPatches veroorzaakten initialisatie problemen.

Oplossing: Vereenvoudigde configuratie naar minimale, werkende setup:

# Voor: Complexe configuratie
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    etcd:
      local:
        dataDir: /tmp/lib/etcd
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
        authorization-mode: "Webhook"
        feature-gates: "EphemeralContainers=true"

# Na: Vereenvoudigde configuratie
kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"

Stap 2: Containerd ConfigPatches Uitschakeling

Probleem: Registry configuratie patches veroorzaakten containerd opstartproblemen.

Oplossing: Tijdelijk uitgeschakeld voor stabiliteit:

# Temporarily disabled for testing
# containerdConfigPatches:
# - |-
#   [plugins."io.containerd.grpc.v1.cri".registry]
#     config_path = "/etc/containerd/certs.d"

Stap 3: Setup Script Verbeteringen

A. Container Limits Configuratie Functie

Toegevoegd aan setup-dev-cluster.sh:

# Configure container resource limits to prevent CRI issues
configure_container_limits() {
    print_status "Configuring container resource limits..."
    
    # Configure file descriptor and inotify limits to prevent CRI plugin failures
    podman exec "${CLUSTER_NAME}-control-plane" sh -c '
        echo "fs.inotify.max_user_instances = 1024" >> /etc/sysctl.conf
        echo "fs.inotify.max_user_watches = 524288" >> /etc/sysctl.conf
        echo "fs.file-max = 2097152" >> /etc/sysctl.conf
        sysctl -p
    '
    
    # Restart containerd to apply new limits
    print_status "Restarting containerd with new limits..."
    podman exec "${CLUSTER_NAME}-control-plane" systemctl restart containerd
    
    # Wait for containerd to stabilize
    sleep 10
    
    # Restart kubelet to ensure proper CRI communication
    podman exec "${CLUSTER_NAME}-control-plane" systemctl restart kubelet
    
    print_success "Container limits configured and services restarted"
}

B. CRI Status Verificatie Functie

# Verify CRI status and functionality
verify_cri_status() {
    print_status "Verifying CRI status..."
    
    # Wait for services to stabilize
    sleep 15
    
    # Test CRI connectivity
    if podman exec "${CLUSTER_NAME}-control-plane" crictl version &>/dev/null; then
        print_success "CRI is functional"
        
        # Show CRI version info
        print_status "CRI version information:"
        podman exec "${CLUSTER_NAME}-control-plane" crictl version
    else
        print_error "CRI is not responding - checking containerd logs"
        podman exec "${CLUSTER_NAME}-control-plane" journalctl -u containerd --no-pager -n 20
        
        print_error "Checking kubelet logs"
        podman exec "${CLUSTER_NAME}-control-plane" journalctl -u kubelet --no-pager -n 10
        
        return 1
    fi
    
    # Verify node readiness
    print_status "Waiting for node to become Ready..."
    local max_attempts=30
    local attempt=0
    
    while [ $attempt -lt $max_attempts ]; do
        if kubectl get nodes | grep -q "Ready"; then
            print_success "Node is Ready"
            return 0
        fi
        
        attempt=$((attempt + 1))
        print_status "Attempt $attempt/$max_attempts - waiting for node readiness..."
        sleep 10
    done
    
    print_error "Node failed to become Ready within timeout"
    kubectl get nodes -o wide
    return 1
}

C. Hoofduitvoering Update

# Main execution
main() {
    # ... existing code ...
    
    check_prerequisites
    create_host_directories
    create_cluster
    configure_container_limits    # ← Nieuw toegevoegd
    verify_cri_status            # ← Nieuw toegevoegd
    install_ingress_controller
    apply_manifests
    verify_cluster
    
    # ... rest of function ...
}

Resultaten

✅ Succesvolle Oplossingen

Cluster Creatie: Kind clusters worden nu succesvol aangemaakt
Node Status: Control-plane nodes bereiken Ready status
CRI Functionaliteit: Container runtime communiceert correct met kubelet
Basis Kubernetes Operaties: Deployments, services, en pods werken correct

⚠️ Resterende Beperkingen

Ingress Controller Probleem: De NGINX Ingress controller ondervindt nog steeds "too many open files" fouten vanwege file descriptor beperkingen die niet kunnen worden aangepast binnen de Kind container omgeving.

Foutmelding:

too many open files

Oorzaak: Dit is een beperking van de Kind/Podman setup waar kernel parameters niet kunnen worden aangepast vanuit containers.

Troubleshooting Commands

Diagnose Commands

# Controleer containerd status
ssh minty "podman exec eveai-dev-cluster-control-plane systemctl status containerd"

# Bekijk containerd logs
ssh minty "podman exec eveai-dev-cluster-control-plane journalctl -u containerd -f"

# Test CRI connectiviteit
ssh minty "podman exec eveai-dev-cluster-control-plane crictl version"

# Controleer file descriptor usage
ssh minty "podman exec eveai-dev-cluster-control-plane sh -c 'lsof | wc -l'"

# Controleer node status
kubectl get nodes -o wide

# Controleer kubelet logs
ssh minty "podman exec eveai-dev-cluster-control-plane journalctl -u kubelet --no-pager -n 20"

Cluster Management

# Cluster verwijderen (met Podman provider)
KIND_EXPERIMENTAL_PROVIDER=podman kind delete cluster --name eveai-dev-cluster

# Nieuwe cluster aanmaken
cd /path/to/k8s/dev && ./setup-dev-cluster.sh

# Cluster status controleren
kubectl get all -n eveai-dev

Preventieve Maatregelen

1. Configuratie Validatie

Minimale Kind Configuratie: Gebruik alleen noodzakelijke kubeadmConfigPatches
Stapsgewijze Uitbreiding: Voeg complexe configuraties geleidelijk toe
Testing: Test elke configuratiewijziging in isolatie

2. Monitoring

Health Checks: Implementeer uitgebreide CRI status controles
Logging: Monitor containerd en kubelet logs voor vroege waarschuwingen
Automatische Recovery: Implementeer automatische herstart procedures

3. Documentatie

Configuratie Geschiedenis: Documenteer alle configuratiewijzigingen
Troubleshooting Procedures: Onderhoud actuele troubleshooting guides
Known Issues: Bijhouden van bekende beperkingen en workarounds

Aanbevelingen voor Productie

1. Infrastructure Alternatieven

Voor productie-omgevingen waar Ingress controllers essentieel zijn:

Volledige VM Setup: Gebruik echte virtuele machines waar kernel parameters kunnen worden geconfigureerd
Bare-metal Kubernetes: Implementeer op fysieke hardware voor volledige controle
Managed Kubernetes: Overweeg cloud-managed solutions (EKS, GKE, AKS)

2. Host-level Configuratie

# Op de host (minty) machine
sudo mkdir -p /etc/systemd/system/user@.service.d/
sudo tee /etc/systemd/system/user@.service.d/limits.conf << EOF
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EOF
sudo systemctl daemon-reload

3. Alternatieve Ingress Controllers

Test andere ingress controllers die mogelijk lagere file descriptor vereisten hebben:

Traefik
HAProxy Ingress
Istio Gateway

Conclusie

De containerd CRI plugin failure is succesvol opgelost door:

Vereenvoudiging van de Kind cluster configuratie
Implementatie van container resource limits configuratie
Toevoeging van uitgebreide CRI status verificatie
Verbetering van error handling en diagnostics

Het cluster is nu volledig functioneel voor basis Kubernetes operaties. De resterende Ingress controller beperking is een bekende limitatie van de Kind/Podman omgeving en vereist alternatieve oplossingen voor productie gebruik.

Bijlagen

A. Gewijzigde Bestanden

k8s/dev/setup-dev-cluster.sh - Toegevoegde functies en verbeterde workflow
k8s/dev/kind-dev-cluster.yaml - Vereenvoudigde configuratie
k8s/dev/kind-minimal.yaml - Nieuwe minimale test configuratie

B. Tijdsinschatting Oplossing

Probleem Identificatie: 2-3 uur
Root Cause Analyse: 1-2 uur
Oplossing Implementatie: 2-3 uur
Testing en Verificatie: 1-2 uur
Documentatie: 1 uur
Totaal: 7-11 uur

C. Lessons Learned

Complexiteit Vermijden: Start met minimale configuraties en bouw geleidelijk uit
Systematische Diagnose: Gebruik gestructureerde troubleshooting approaches
Environment Beperkingen: Begrijp de beperkingen van containerized Kubernetes (Kind)
Monitoring Essentieel: Implementeer uitgebreide health checks en logging
Documentatie Cruciaal: Documenteer alle wijzigingen en procedures voor toekomstig gebruik

11 KiB Raw Blame History