MLOps Engineer (SRE) en PEOPLE

Remoto (México, Canadá o Estados Unidos) | Senior | Full time | SysAdmin / DevOps / QA

Sueldo bruto $3800 - 5000 USD/mes

5 postulaciones

Responde entre 4 y 6 días

Revisado por última vez hoy

Postular Postulación rápida

ⓘ Requiere postular en Inglés

We are looking for an MLOps Engineer with a focus on Site Reliability Engineering (SRE) to join a critical artificial intelligence project. This project is aimed at ensuring the reliability, traceability, and availability of machine learning models in production, guaranteeing high availability, low latency, and active monitoring of both business and ML metrics.

Las postulaciones son recibidas únicamente en getonbrd.com.

Main Responsibilities

Design and operate observability solutions for ML models in production (monitoring, alerts, traceability).
Develop dashboards and metrics to evaluate model performance, cost, and stability.
Implement tools for structured logging, drift monitoring, data quality, and inference error tracking.
Automate scaling, fault recovery, and self-healing of inference services.
Establish SLAs/SLIs/SLOs for ML pipelines and intelligent services.
Collaborate with data science and product teams to detect and mitigate incidents related to models in production.
Set up rollback policies and blue/green deployments for model versions.
Apply SRE practices such as chaos engineering, stress testing, staging tests, and continuous integration.

Profile Requirements

Minimum of 4 years of experience as an SRE, DevOps, or Platform Engineer in machine learning projects.
Knowledge of model monitoring frameworks such as Evidently, Arize AI, WhyLabs, or similar.
Proficiency with tools like Prometheus, Grafana, ELK/EFK, OpenTelemetry, or Datadog.
Experience with orchestrators such as Airflow, Kubeflow, or experiment tracking tools (MLflow, Weights & Biases).
Strong knowledge of Kubernetes, Docker, Helm, and infrastructure-as-code tools (Terraform, Pulumi).
Experience with CI/CD for ML pipelines (testing, validation, rollback).
Ability to automate processes, monitor systems in real time, and respond to critical incidents.
Strong collaborative skills to work closely with data scientists and product teams.
Attention to detail, resilience in high-pressure environments, and a mindset focused on continuous improvement.

Nice-to-have (non-mandatory)

Experience operating models on Alibaba Cloud and configuring observability in that environment.
Familiarity with strategies such as canary deployment, shadow testing, and controlled experimentation.
Knowledge of explainable AI frameworks and model auditing.
Previous experience in high-transaction environments such as banking, accounting, payroll, or logistics.

Benefits

Work modality: Remote.

Project duration: 1 year, with the possibility of extension.

GETONBRD Job ID: 55813

Computadora PEOPLE proporciona una computadora para tu trabajo.

Política de trabajo remoto

Remoto sólo localmente

El trabajo es 100% remoto, pero los candidatos deben residir en México, Canadá o Estados Unidos para postular.

Postular

Reporta este empleo

Acerca de PEOPLE

People Co. es una empresa especializada en la búsqueda y selección de talento. Ofrece soluciones ágiles y personalizadas en reclutamiento, adaptadas a las necesidades de cada organización. — Perfil completo de PEOPLE

Seguir