Distributed Systems Engineer (41404)
Join this dynamic team as a Distributed Systems Engineer to design and maintain cutting-edge, scalable distributed systems supporting AI workloads in cloud environments. As a key player, you'll optimize performance, reduce latency, and ensure fault tolerance across globally distributed systems. With 7+ years in building production-grade infrastructure, you'll troubleshoot complex issues, work with cross-functional teams, and improve system observability for high availability. If you're passionate about performance, low-latency systems, and open-source, don't hesitate to reach out.
🚀 Project
- design and maintain scalable, reliable distributed systems to support large-scale AI workloads in cloud environments
- optimize performance and fault tolerance across geographically distributed systems to ensure high availability and minimal latency
- troubleshoot and resolve complex system issues, including performance bottlenecks, failures, and scalability problems
- collaborate with cross-functional teams to integrate AI workloads and ensure seamless interaction with the underlying infrastructure
- monitor and enhance system observability (through logging, tracing, and metrics) to ensure the health and performance of the platform
🎯 Skills
- 7+ years building distributed systems in production
- experience operating large-scale infrastructure (100k+ RPS, multi-region)
- strong Linux internals knowledge (kernel, cgroups, scheduling, memory)
- experience with VMs / hypervisors (Firecracker, KVM, QEMU)
- strong systems programming in Go (or Rust / C / C++)
- production experience with orchestration (Kubernetes, Nomad, or custom)
- performance-driven mindset (latency, p99, profiling, hot paths)
- solid networking fundamentals (L4/L7, namespaces, iptables/nftables)
- comfortable working in open source
💡 Nice to have
- eBPF, userfaultfd (UFFD), COW, lazy loading
- Firecracker internals or contributions
- GPU passthrough / PCIe virtualization
- infrastructure for AI / ML workloads
- large-scale observability (kernel metrics, tracing)
- contributions to low-level open-source projects
#LI-MH8