New
Build in-house tooling to support post-training models Work across the stack: Kubernetes, storage, networking Leverage PyTorch distributed tensor computation and GPU kernels Train a wide spectrum of model architectures at scale Collaborate with researchers to define specs and requirements Address systems-level ML infra and tooling challenges Deep understanding of modern ML techniques for training transformers Advanced experience with PyTorch, TensorFlow, or Jax Knowledge of transformer training parallelism: data, tensor, pipeline Ability to profile and optimize distributed GPU programs Familiarity with HPC and distributed platforms: Slurm, Ray, Kubernetes, Dask Familiarity with cluster networking: Infiniband, RoCE, GPUDirect Competitive compensation, including meaningful equity 100% medical/dental/vision coverage for employees and dependents Generous PTO including Winter Break Paid parental leave Company-facilitated 401(k) Exposure to a variety of ML startups and learning opportunities