CS

AI Infrastructure Operations Engineer

Verified

Cerebras Systems

Sunnyvale CA or Toronto Canada

Posted 2 weeks ago

Posted 15 April 2026

full-time

About the Role

<div class="content-intro">Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.   Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. <a href="https://openai.com/index/cerebras-partnership/">OpenAI recently announced a multi-year partnership with Cerebras</a>, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference.  Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.</div>About The Role We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.  You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.   <h4>Responsibilities</h4> <ul> <li>Manage and operate multiple advanced AI compute infrastructure clusters. </li> <li>Monitor and oversee cluster health, proactively identifying and resolving potential issues. </li> <li>Maximize compute capacity through optimization and efficient resource allocation. </li> <li>Deploy, configure, and debug container-based services using Docker. </li> <li>Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed. </li> <li>Handle engineering escalations and collaborate with other teams to resolve complex technical challenges. </li> <li>Contribute to the development and improvement of our monitoring and support processes. </li> <li>Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies. </li> </ul> <h4>Skills And Requirements</h4> <ul> <li>6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing. </li> <li>Strong proficiency in Python scripting for automation and system administration. </li> <li>Deep understanding of Linux-based compute systems and command-line tools. </li> <li>Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM. </li> <li>Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner. </li> <li>Experience with monitoring and alerting systems. </li> <li>Should have a proven track record to own and drive challenges to completion. </li> <li>Excellent communication and collaboration skills. </li> <li>Ability to work effectively in a fast-paced environment. </li> <li>Willingness to participate in a 24/7 on-call rotation. </li> </ul> Preferred Skills And Requirements <ul> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Symbol","469769242":[8226],"469777803":"left","469777804":"","469777815":"multilevel"}" data-aria-posinset="1" data-aria-level="1">Operating large scale GPU clusters.</li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Symbol","469769242":[8226],"469777803":"left","469777804":"","469777815":"multilevel"}" data-aria-posinset="1" data-aria-level="1">Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.</li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Symbol","469769242":[8226],"469777803":"left","469777804":"","469777815":"multilevel"}" data-aria-posinset="1" data-aria-level="1">Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).</li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{"335552541":1,"335559685":720,"335559991":360,"469769226":"Symbol","469769242":[8226],"469777803":"left","469777804":"","469777815":"multilevel"}" data-aria-posinset="1" data-aria-level=&q

AI Infrastructure Operations Engineer

About the Role

Related Searches

Ready to apply?

Similar Roles

Senior Backend Engineer

Data Scientist

Data Engineering Manager