AI Evaluation and QA Jobs
Quality operators who design evals, inspect AI outputs, find failure patterns, and define what good behavior means for agents or AI workflows.
counts when
[x] builds rubrics
[x] finds hallucinations
[x] tests agent behavior
reject when
[-] generic QA
[-] data labeling only
[-] model training only
source signals
# groundedness
# benchmark scenarios
# failure analysis
editorial filter
This work asks whether an AI system is good enough for real use. The job should involve eval rubrics, golden datasets, groundedness checks, hallucination review, workflow QA, or failure analysis.
Generic QA is not enough. Data labeling alone is not enough. A match should ask for judgment about model output, user usefulness, business risk, and where plausible answers still fail.
› Contract eval/annotation role for AI output quality — you judge whether AI-generated content meets accuracy, cultural, and brand standards, with flexible hours and no coding required.
comp not disclosed1d agoapplyAI_EVALUATION_/_QAAI Operations Specialist | Housing (New Grads 2025-2026)EliseAINew York City› Entry-level AI ops seat watching dashboards and logs to catch AI failures before they escalate — you flag and coordinate, engineers do the fixing.
comp not disclosed1d agoapplyAI_EVALUATION_/_QAAI Conversation DesignerPearlRemote, US› Conversation design role writing prompts, optimizing chatbot flows, and QA-testing LLM experiences for customer journeys.
comp not disclosed4d agoapplyAI_EVALUATION_/_QAApplied AI Evaluation ScientistJumpRemote (U.S.)› Evaluation role owning RAG and agent quality frameworks, with research-grade Python instead of production infrastructure.
$180–270k4d agoapplyAI_EVALUATION_/_QAAI Quality OperatorNeon HealthSan Francisco, CA (USA)› Healthcare QA role reviewing AI agent calls, catching errors, labeling issues, and improving real workflows.
$594k4d agoapplyAI_EVALUATION_/_QAOperations Specialist, AI EnablementBumble Inc.Austin, TX / London / Remote› QA role reviewing AI support conversations for accuracy, policy fit, tone, and recurring failure patterns.
comp not disclosed4d agoapplyAI_EVALUATION_/_QAAdversarial Prompt ExpertReinforce LabsRemote› Red-team role finding jailbreaks, ranking model failures, and documenting attack paths so safety teams can patch them.
comp not disclosed4d agoapplyAI_EVALUATION_/_QAPrompt EngineerCantinaLos Angeles; San Francisco› Expert prompt role owning AI character behavior, personality systems, and evaluation frameworks for social generative AI experiences.
$150–180k6d agoapplyAI_EVALUATION_/_QAPrompt Engineer - AI Innovation Team - USSitusAMCUS - Remote› Prompt-focused AI role owning use-case translation, agent behavior oversight, and quality testing for commercial real estate workflows.
$50k6d agoapplyAI_EVALUATION_/_QAAI Content Reviewer (Video)Crossing HurdlesRemote› High-signal role for evaluating the next generation of AI video models.
$25–34k1w agoapplyAI_EVALUATION_/_QASenior AI Evaluation Specialist — IP Guardrails and Agentic WorkflowsAdobeNew York, NY› Recently posted role for Adobe seeking an AI Eval Specialist.
$155–281k1w agoapplyAI_EVALUATION_/_QAAI Agent Architect, Customer ExperienceAirtableRemote - US› Strong fit for Airtable - focus on workflow designer.
$196–278k1w agoapplyAI_EVALUATION_/_QAAI Operations SpecialistBretton AISan Francisco, CA› An interesting operational role for someone interested in the evaluation and quality side of AI agent deployments.
$90–105k1w agoapply