Model Evaluation Engineer

New

Skills

Benchmarking Pipelines Cloud Infrastructure Customer Feedback Analysis Evaluation Datasets Experimental Design Model Evaluation Python Programming SQL Statistical Analysis Voice Agent Technology

As a Research Engineer focusing on Evaluations, you will be responsible for overseeing the evaluation process of models, ensuring they meet accuracy, latency, and feature-specific metrics. This role involves building and maintaining benchmarking pipelines, designing experiments, and collaborating with various teams to translate customer feedback into actionable evaluation criteria.

Key Responsibilities
  • Own end-to-end and integration-level model evaluation across multiple metrics.
  • Build and maintain competitive benchmarking pipelines.
  • Design and run systematic experiments to measure model changes' impact.
  • Onboard, curate, and maintain evaluation datasets.
  • Create evaluation subsets for stress-testing capabilities and edge cases.
  • Define evaluation metrics for real-world performance.
  • Translate qualitative customer feedback into quantifiable evaluation criteria.
  • Work with customer-facing teams to understand pain points and convert them into research priorities.
  • Maintain clean evaluation pipelines and clear documentation.
  • Identify evaluation gaps proactively and propose solutions.
Required Skills & Qualifications
  • Strong understanding of ML fundamentals and debugging without retraining.
  • Proficient in Python for writing evaluation scripts and managing data pipelines.
  • Comfortable with SQL and cloud infrastructure.
  • Intuition for good evaluation metrics and statistical rigor.
  • Familiarity with voice agent stack, including VAD, ASR, LLM, and TTS systems.
  • Tinkerer mentality with a preference for shipping and iterating quickly.
  • Excellent communication skills for summarizing findings and translating technical results.
  • Ownership mindset with a proactive approach to filling evaluation gaps.
  • Ability to work 3-4 hours overlapping with Eastern US Time Zone.

No forms. Your profile is generated instantly.

Job Type: Remote

Salary: Not Disclosed

Experience: Entry

Duration: Months

Share this job:

Similar Jobs

Model Evaluation Engineer

Posted 6 days ago

Conduct comprehensive model evaluations.

Establish and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Data Pipelines Evaluation Metrics

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Develop and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Data Pipelines Documentation

Model Evaluation Engineer

New

Lead end-to-end model evaluation processes.

Develop and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Dataset Curation Documentation

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Build and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Customer Feedback Analysis Data Pipeline Management

Strategic Partner Development

Posted 9 days ago

Architect alliances with hardware partners.

Identify decision-makers within partner organizations.

Cloud Infrastructure Cross-Functional Leadership Market Analysis Mentoring and Coaching

AI-Enabled DevOps Engineer

New

Implement and maintain cloud infrastructure with IaC.

Improve CI/CD pipelines for applications and ML workloads.

Bash CI/CD Pipelines Cloud Infrastructure DevOps

Model Evaluation Engineer

Posted 7 days ago

Evaluate models across accuracy and latency.

Build benchmarking pipelines for competitive analysis.

Automatic Speech Recognition (ASR) Cloud Infrastructure Data Pipelines Large Language Models (LLM)

Junior Technical Program Manager

Posted 7 days ago

Support delivery of data center programs.

Manage timelines and project scope.

AI Infrastructure Cloud Infrastructure Cross-functional Coordination Data Center Infrastructure

Model Evaluation Engineer

Posted 6 days ago

Conduct comprehensive model evaluations.

Establish and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Data Pipelines Evaluation Metrics

Strategic Sourcing Manager

Posted 6 days ago

Partner with engineering leaders for sourcing plans.

Lead sourcing across infrastructure and AI technology.

AI Technologies Cloud Infrastructure Data Analysis Developer Platforms

Engineering Program Manager

Posted 6 days ago

Unify technology strategy and enhance decision-making.

Oversee cross-functional initiatives from start to finish.

CI/CD Pipelines Cloud Infrastructure Cross-Functional Leadership Data Analysis

Senior ML Engineer

Posted 3 days ago

Develop and maintain ML platform infrastructure.

Provide shared components for deployment and API design.

Algorithms API Design Cloud Infrastructure Collaboration Tools

Senior DevOps Engineer

Posted 3 days ago

Build automation tools for resource delivery.

Collaborate with engineering teams for quality product delivery.

Automation Tools Cloud Infrastructure Containerization DevOps

Director of Strategic Alliances

Posted 3 days ago

Lead strategic partnerships with key industry players.

Develop go-to-market strategies for AI and GPU deployments.

AI/ML Workloads Cloud Infrastructure Data Centers GPU Technologies

Privacy Engineer Role

New

Ensure user privacy across data handling.

Develop tools for privacy enhancement.

Cloud Infrastructure Code Review Data Mapping Go

Security & Infrastructure Lead

New

Lead security and infrastructure strategy.

Manage and develop security teams.

AWS CI/CD Cloud Infrastructure Container Orchestration

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Develop and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Data Pipelines Documentation

Model Evaluation Engineer

New

Lead end-to-end model evaluation processes.

Develop and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Dataset Curation Documentation

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Build and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Customer Feedback Analysis Data Pipeline Management

Model Evaluation Engineer

New

Lead end-to-end model evaluation.

Build competitive benchmarking pipelines.

Benchmarking Cloud Infrastructure Data Pipelines Documentation

Starlink Aviation Account Lead

Posted 9 days ago

Serve as the primary contact for Aviation accounts.

Manage onboarding and account tasks post-signature.

Aviation Industry Knowledge Consulting Contract Management Cross-Functional Coordination

Remote Product Manager

Posted 7 days ago

Hiring for a remote Product Manager position.

Position is full-time and has no geographical restrictions.

Agile Methodologies Communication Skills Cross-functional Collaboration Customer Feedback Analysis

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Build and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Customer Feedback Analysis Data Pipeline Management

Junior Data Scientist

New

Support clients with data science methodologies.

Collaborate with Data Science teams.

Consumer Behavior Analysis Data Analysis Database Systems Experimental Design

Senior Data Scientist Role

New

Design optimization solutions for pricing and resource allocation.

Deploy and maintain machine learning models in production.

Causal Inference Collaboration Tools Data Analysis Experimental Design

Senior Bioprocess Engineer

New

Serve as primary technical contact for clients.

Advise on experimental design and process decisions.

Bioprocess Engineering Bioreactor Operations Cell Physiology Experimental Design

People Intelligence Intern

New

Leverage statistical analysis for insights on talent themes.

Support development of data-driven models for talent decisions.

Behavioral Data Analysis Data Cleaning Data Structuring Experimental Design

Product Manager - Community

New

Enhance Twitch's growth levers through tactical improvements.

Build a strategic plan for notifications platform evolution.

Agile Methodologies Collaboration Communication Skills Data Analysis

AI Research Manager

Posted 18 days ago

Lead research direction for advanced AI systems

Guide the design of cutting-edge RAG systems

Data Analysis Deep Learning Documentation Leadership

Generalist - Language AI Evaluation

Posted 18 days ago

Evaluate LLM-generated responses

Conduct fact-checking on model responses

Ai Analytical Thinking Content Writing Data Annotation

Remote Chemistry AI Tutor

Posted 18 days ago

Connect chemistry experts to AI projects

Improve AI model reasoning in chemistry

Critical Thinking Data Annotation Model Evaluation Remote Collaboration

Remote Mathematics AI Tutor

Posted 18 days ago

Support AI model development with expert mathematics input

Evaluate and refine AI-generated mathematical responses

Data Annotation Mathematics Model Evaluation Prompt Engineering

Remote Electrical AI Tutor

Posted 18 days ago

Collaborate remotely on AI projects

Enhance generative AI with domain expertise

Analytical Thinking Data Annotation English Proficiency Generative AI

Civil Engineering AI Tutor

Posted 18 days ago

Enhance AI with civil engineering expertise

Generate and evaluate AI prompts

Analytical Skills Critical Thinking Generative AI Model Evaluation

Generalist - AI Language Model

Posted 18 days ago

Improve conversational AI systems

Assess model-generated responses

AI Development Analytical Skills Communication Skills Machine Learning

ML Research Engineer

Posted 18 days ago

Architect and maintain evaluation suites

Build scalable pipelines for model training

Data Engineering Model Evaluation Python Pytorch

AI/ML Product Builder

Posted 18 days ago

Define AI/ML agents for reliability

Prototype agent behaviours

Ai/ml CoPilot LLMs Model Evaluation

Data Scientist/AI Trainer

Posted 18 days ago

Develop and maintain Python code for data analysis, model evaluation, and AI workflow automation.

Design and refine prompts for LLMs to optimize conversational performance.

Conversational AI Data Analysis Data Science Machine Learning

Senior Product Manager - Intelligence Catalog

Posted 18 days ago

Lead and own the Intelligence Catalog and taxonomy

Drive improvements in noise reduction and precision/recall metrics

Ai/ml Communication Skills Data Science Enterprise saas

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Develop and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Data Pipelines Documentation

Model Evaluation Engineer

New

Lead end-to-end model evaluation processes.

Develop and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Dataset Curation Documentation

Model Evaluation Engineer

New

Conduct comprehensive model evaluations.

Build and maintain benchmarking pipelines.

Benchmarking Pipelines Cloud Infrastructure Customer Feedback Analysis Data Pipeline Management

Model Evaluation Engineer

New

Lead end-to-end model evaluation.

Build competitive benchmarking pipelines.

Benchmarking Cloud Infrastructure Data Pipelines Documentation

Ubuntu Sales Engineer (Entry-Level)

Posted 18 days ago

Drive adoption of Ubuntu Pro in enterprise settings

Understand and address customer requirements

AWS Azure Cloud Computing Containers

Automation Lead

Posted 18 days ago

Lead team towards high-impact solutions, Work collaboratively with scientific teams, Stay updated

cutting-edge tools, Develop novel assays, Efficiently allocate team

Genomics Python Programming

Agentic AI Developer Research

Posted 18 days ago

Understand user experiences with agentic AI systems

Gather insights from developers and practitioners in the field

Android API Data Science Deep Learning

Cryptographic Client Server System

Posted 18 days ago

Implement public-key cryptography for client security.

Facilitate device addition and revocation for user accounts.

Cryptography Cybersecurity Management Data Encryption Python Programming

Cryptography Client-Server Assignment

Posted 18 days ago

Implement public-key cryptography for secure client-server communication.

Enable clients to manage device access through per-device keys.

Cybersecurity Management Data Encryption Data Security Python Programming