GPU telemetry with workload attribution. One OTLP agent per node ties hardware metrics (NVIDIA, AMD, Intel Gaudi) to the K8s pod or Slurm job burning the GPU — so you know who's paying for that idle H100.
FakeAI: Rapid Development and Testing for AI Infrastructure
GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring