Role Overview
The Lead – Product Operations is responsible for ensuring that DRPL’s AI-driven video understanding and video modification systems run smoothly across both VoD and Live environments. This is a hands-on, highly technical role that blends dataset preparation, AI model training support, software testing, production monitoring, and DevOps practices . The role requires someone who can bridge the gap between engineering and operations, ensuring the health of our AI pipelines and infrastructure on a daily basis.
You will own operational excellence for production systems, ensuring data readiness, pipeline reliability, and proactive issue resolution across both cloud and on-premise (edge) deployments.
Key Responsibilities
AI Data & Training Operations
- Prepare, curate, and validate datasets for AI model training.
- Coordinate with AI engineers to ensure model training pipelines run efficiently.
- Validate AI model outputs against quality benchmarks before deployment.
Production Pipeline Monitoring & Maintenance
- Ensure end-to-end health of AI pipelines (data ingestion, processing, inference, output delivery).
- Monitor system uptime and performance using observability tools.
- Configure alerts and escalation workflows for critical failures or performance degradation.
- Maintain deployment documentation, runbooks, and standard operating procedures.
DevOps & Deployment
- Deploy and maintain AI and software components on cloud platforms as well as bare-metal/edge systems .
- Install and configure operating systems, dependencies, and runtime environments.
- Automate deployment and monitoring where possible.
Software & Model Testing
- Conduct integration and functional testing for new releases of AI models and platform features.
- Work with QA engineers to create and maintain test plans for AI workflows.
Cross-functional Collaboration
- Work closely with Engineering Leads and Architects to implement operational improvements.
- Coordinate with Product Managers to align operational tasks with business priorities.
Required Skills & Experience
- Strong technical background with hands-on experience in Linux systems administration , shell scripting, and automation tools.
- Knowledge of DevOps tools (Docker, Kubernetes, CI/CD systems, monitoring tools like Prometheus/Grafana).
- Understanding of AI/ML workflows including dataset preparation, training pipelines, and inference systems.
- Experience with cloud services (AWS, GCP, Azure) and bare-metal / edge deployments .
- Familiarity with software testing practices and QA methodologies.
- Ability to troubleshoot software, model, and system issues across the stack.
- Strong analytical and problem-solving skills; proactive in identifying and resolving operational issues.
- Comfortable working with both structured and unstructured data .
Nice-to-Have Skills
- Exposure to video processing or AI-based media workflows.
- Experience with GPU-accelerated systems and drivers.
- Familiarity with database systems and message queues (PostgreSQL, MongoDB, Kafka, Redis, etc.).
- Scripting in Python for automation and data manipulation.