DevOps Engineer (42498)
I'm looking for a Senior DevOps Engineer who will lead the development and optimization of AI/ML solutions for our clients. You’ll work closely with cross-functional teams to design and implement PoCs, optimize GPU clusters, and troubleshoot AI workflows. Strong experience with NVIDIA GPUs, Slurm, and containerized AI workflows is essential, along with proficiency in Python and Bash scripting.
🚀 Project
- consulting customers on all technical aspects related to GPU infrastructure, AI/ML model training, and platform usage
- leading onboarding and training, mentoring customer specialists on optimal usage of their GPU clusters and AI environments
- designing and implementing PoCs, including environment setup, data processing pipelines, and deployment workflows
- conducting requirement engineering, translating business needs into technical specifications
- assisting customers with performance optimization, troubleshooting, fine‑tuning, and validation of delivered solutions
- acting as the key technical point of contact, coordinating cross‑functional teams across infrastructure, networking, automation, security, and AI services
- proposing and developing automation concepts to improve services, processes, and operating models
- ensuring best practices in reliability, scalability, responsible AI, and security are applied across the customer lifecycle
- supporting monitoring, observability, and capacity planning for AI workloads and GPU utilization
🎯 Skills
- Master’s degree in information technology, Computer Engineering, Applied AI, or related field
- strong knowledge of NVIDIA GPUaccelerated platforms (DGX, B200, RTX Pro Servers)
- experience running and training selfhosted LLMs, including model finetuning and inference optimization
- handson experience with Slurm, Run:AI, or other GPU workload schedulers
- advanced Linux administration skills
- solid understanding of Kubernetes and containerized AI workflows
- proficiency in scripting (Python, Bash) for automation, data manipulation, and tooling
- experience with Infrastructure as Code (Ansible, Terraform, Helm)
- knowledge of SoftwareDefined Networking (SDN) and highperformance network architectures
- experience with monitoring and visualization tools (Prometheus, Grafana, Alert manager)
- experience working with Data Engineering/Transformation/Migration tools and pipelines
- understanding of LLM architectures, embeddings, and vector databases
- familiarity with RAG pipelines, model evaluation, and prompt engineering
- knowledge of responsible AI practices (security, governance, compliance)
- experience with AI/ML frameworks: PyTorch, TensorFlow, Hugging Face, Triton Inference Server
- english (C1) is required; German is an advantage
- strong customerfacing communication skills, both technical and nontechnical
- experience with requirement engineering (basic)
- experience with software testing, quality assurance, and validation (intermediate)
- analytical mindset, problemsolving skills, structured approach to troubleshooting
- ability to work independently as well as coordinate with cross-functional teams