Files
HomeServer/MONITORING.md
2026-03-14 11:42:08 +00:00

7.8 KiB
Raw Permalink Blame History

Monitoring Stack Resource Analysis

Date: October 23, 2025 System: kimchi homelab server Status: Planning/analysis only — monitoring stack has not been deployed yet

Current System Status

System Specifications:

  • CPU: 4 cores
  • Memory: 7.6 GB total
  • Root Disk: 69 GB NVMe (/dev/nvme0n1p2)
  • Data Storage: 3.6 TB bcache (/mnt/bcache)

Current Usage:

  • Load: 0.52 (13% CPU on 4 cores)
  • Memory: 3.3 GB / 7.6 GB used (43%)
  • Available: 3.8 GB
  • Disk: 47 GB / 69 GB used (72% on root)
  • Running pods: 29 total

Top Memory Consumers:

  • K3s server: 687 MB (8.6%)
  • Jellyfin: 458 MB (5.7%)
  • MariaDB (Nextcloud): 330 MB (4.1%)
  • Home Assistant: 306 MB (3.8%)

Prometheus + Grafana Resource Impact

For a minimal monitoring stack in this homelab setup:

Expected Resource Usage:

Component Memory CPU Notes
Prometheus 400-600 MB 200-400m (5-10%) Main metrics database
Grafana 150-250 MB 100-200m (2-5%) Visualization UI
Node Exporter 20-50 MB 50-100m (1-2%) Per-node metrics
kube-state-metrics 50-100 MB 50-100m (1-2%) K8s cluster metrics
AlertManager (optional) 50-100 MB 50m (<1%) Alert routing
Total (minimal) ~700-1100 MB ~450-800m (11-20%)

Impact on System:

CPU Load Increase:

  • Current: 13% (0.52 load average)
  • After monitoring: 24-33% (0.96-1.32 load average)
  • Estimated increase: +11-20% (well within headroom)

Memory Impact:

  • Current: 3.3 GB used / 3.8 GB available
  • After monitoring: 4.0-4.4 GB used / 2.7-3.1 GB available
  • Estimated increase: +700-1100 MB (manageable, but less buffer)

Disk Impact:

  • Prometheus data: 2-5 GB for 15-day retention with ~30 pods
  • Root partition: Already at 72% (47 GB used of 69 GB)
  • Recommendation: Store Prometheus data on /mnt/bcache instead of root

Minimal kube-prometheus-stack Setup

Helm chart: prometheus-community/kube-prometheus-stack

values.yaml (optimized for homelab):

# Prometheus configuration
prometheus:
  prometheusSpec:
    retention: 15d  # 15 days of metrics
    resources:
      requests:
        memory: 512Mi
        cpu: 250m
      limits:
        memory: 1Gi
        cpu: 500m
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: local-path  # Uses /mnt/bcache
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi

# Grafana configuration
grafana:
  resources:
    requests:
      memory: 128Mi
      cpu: 100m
    limits:
      memory: 256Mi
      cpu: 200m
  persistence:
    enabled: true
    storageClassName: local-path
    size: 1Gi

# Node exporter (per-node metrics)
prometheus-node-exporter:
  resources:
    requests:
      memory: 30Mi
      cpu: 50m
    limits:
      memory: 50Mi
      cpu: 100m

# Kube-state-metrics (cluster metrics)
kube-state-metrics:
  resources:
    requests:
      memory: 64Mi
      cpu: 50m
    limits:
      memory: 128Mi
      cpu: 100m

# AlertManager (optional - disable if not needed)
alertmanager:
  enabled: false  # Can enable later if needed

Installation Commands

# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create monitoring namespace
kubectl create namespace monitoring

# Install with custom values
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f values.yaml

# Check installation
kubectl get pods -n monitoring
kubectl get svc -n monitoring

# Access Grafana (port-forward)
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Default Grafana credentials
# Username: admin
# Password: prom-operator (check with: kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)

What You'll Get

Features:

  • Real-time CPU/memory/disk metrics for all pods and nodes
  • Historical data for 15 days
  • Pre-built dashboards for Kubernetes cluster overview
  • Pod resource usage tracking
  • Node health monitoring
  • Ability to troubleshoot performance issues
  • Optional alert notifications

Useful Dashboards:

  • Kubernetes Cluster Overview (ID: 315)
  • Kubernetes Pods Resource Usage (ID: 6336)
  • Node Exporter Full (ID: 1860)
  • K8s Cluster RAM and CPU Utilization (ID: 16734)

Alternatives to Consider

If Resources Are Tight:

  1. Metrics Server Only

    • Resource usage: ~50 MB memory, minimal CPU
    • Provides: kubectl top nodes and kubectl top pods commands
    • No historical data, no dashboards
    kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
    
  2. Netdata

    • Resource usage: ~100-200 MB total
    • Lighter weight, simpler setup
    • Good for single-node clusters
    • Built-in web UI
  3. Prometheus + Remote Write

    • Run Prometheus locally but send metrics to external Grafana Cloud
    • Free tier available (10k series, 14-day retention)
    • Saves local resources

Monitoring Best Practices

Resource Tuning:

  • Start with conservative limits and increase if needed
  • Monitor Prometheus memory usage - it grows with number of metrics
  • Use metric relabeling to drop unnecessary metrics
  • Adjust retention period based on actual needs

Storage Considerations:

  • Prometheus needs fast I/O - bcache is ideal
  • Plan for ~300-500 MB per day of metrics with 30 pods
  • Enable persistent volumes to survive pod restarts

Query Optimization:

  • Use recording rules for frequently-used queries
  • Avoid long time ranges in dashboards
  • Use downsampling for historical data

Prometheus Metrics Retention Calculation

Formula: Storage = Retention × Ingestion Rate × Compression Factor

For this cluster:

  • ~30 pods × ~1000 metrics per pod = 30k time series
  • Sample every 15s = 5760 samples/day per series
  • Compressed: ~1-2 bytes per sample
  • 15-day retention: ~2.5-5 GB

Useful Prometheus Queries

CPU Usage:

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace!=""}[5m])) by (pod, namespace)

# Node CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage:

# Memory usage by pod
sum(container_memory_working_set_bytes{namespace!=""}) by (pod, namespace)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Disk Usage:

# Disk usage by mountpoint
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

# Bcache hit rate
rate(bcache_cache_hits_total[5m]) / (rate(bcache_cache_hits_total[5m]) + rate(bcache_cache_misses_total[5m]))

Bottom Line

Verdict: Yes, you can run Prometheus + Grafana with current resources.

Impact Summary:

  • CPU load: 13% → 24-33% ✓ Acceptable
  • Memory: 43% → 53-58% ✓ Acceptable (but less buffer)
  • Disk: Need to use /mnt/bcache ⚠️ Root partition too full

Critical Requirement:

  • Ensure Prometheus stores data on /mnt/bcache using local-path storage class
  • Do NOT store on root partition (already at 72%)

Next Steps:

  1. Create values.yaml with resource limits above
  2. Install kube-prometheus-stack via Helm
  3. Monitor actual resource usage for 1 week
  4. Tune retention period and limits as needed
  5. Set up ingress for Grafana access (optional)

References