pubmatic SRE
JD
Site Reliability Engineer
Location:
Pune, MH, IN
Department: Data Center
Position Description
The AdServer and RTB Production Infrastructure is pivotal to ensuring our software applications' reliability, availability, and overall excellence. As an SRE Engineer, you will be responsible for the AdServer and RTB Production Infrastructure. Your essential duties encompass ensuring the seamless operation and optimal performance of large-scale distributed software applications. Your role revolves around maintaining a robust and high-performing environment, contributing to the reliability of our services, and innovating solutions to guarantee 24/7 availability. By leveraging your technical expertise and dedication, you contribute to maintaining a seamless experience for our users while upholding the highest standards of operational excellence. Your specific responsibilities include:
Responsibilities:
Operational Support
Be a primary point of contact for operational support of multiple large-scale distributed software applications in the Ad Server environment.
Monitor availability of applications, promptly detect anomalies, analyze the impact, debug the problems in production, and follow up for the resolution by working closely with the engineering team.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Diligently work with the engineering team to expedite the resolution of incidents and ensure a swift return to normal operations.
Be innovative in building dashboards, adding metrics, writing automation scripts to reduce operation toil, and streamlining processes to enhance system reliability and stability.
Design and construct software and systems to effectively manage the Ad Serving platform, its underlying infrastructure, and applications.
On Call Availability and Support
Work in shifts to provide continuous on-call support for the production systems and resolve issues on your own by using predefined handbooks
Show a sense of urgency for high-priority issues and arrange war rooms to resolve the problems.
Provide timely updates for high-priority issues and do handovers when a problem needs to be worked out 24*7
Conduct post-incident reviews to identify root causes, recommend preventive measures, and contribute to a culture of learning and improvement.
Requirements:
Bachelor's degree in computer science or related disciplines
Total 3+ years' experience in software development
Ability to program using programming languages like C or C++, Scripting languages like Shell or Python
Good to have prior experience in technical engineering
A proactive approach to identify the problems, performance bottlenecks, and areas of improvement
Must know, Networking, Database (MySQL) and Linux System concepts, Debugging and analyzing the core dumps
Hands-on experience with monitoring and observability tools like Grafana, Nagios, Influx, ELK, etc.
Familiarity with orchestration tools like Docker and Grafana and incident management systems like Zenduty
Excellent communication and collaboration skills, with the ability to work effectively across teams.
Self-motivated and positive mindset to examine any incidents
what to do
Since you're applying for an SRE role at PubMatic that requires some coding (C, C++, Shell, or Python), but you're not proficient in programming, here’s what you can do:
1️⃣ Focus on Your Strengths (SRE, Infra, and Debugging)
This role emphasizes operational support, monitoring, debugging, and incident response.
Highlight your experience with Kubernetes, AWS, CI/CD, Terraform, Grafana, and Linux debugging—all of which are crucial for an SRE.
Be ready to discuss real-world incidents, system optimizations, and how you handled high-priority issues.
2️⃣ Learn Just Enough Coding for SRE Interviews
While you don’t need to be a full-time developer, knowing basic scripting will boost your confidence. Since the job mentions Shell or Python, focus on:
Log parsing & automation: (
grep,awk,sed,jqin shell)Writing simple automation scripts in Python (e.g., a script to check service health or API response).
Debugging crashes & logs (
gdb,strace,journalctl,core dump analysis).
🎯 Quick resources:
3️⃣ Prepare for SRE-Specific Coding Questions
Most SRE interviews don’t focus on LeetCode-style coding. Instead, expect:
✔ Writing a basic script (e.g., check if a service is running, log parser).
✔ Debugging a broken script.
✔ Writing a one-liner command to get data (awk, grep, jq).
Example Questions:
Write a shell script to check if a service is running.
Parse logs and extract failed API requests.
Use Python to check if a server is reachable.
4️⃣ Prepare for System Debugging & Incident Handling
Since monitoring and debugging are key, practice explaining: ✅ How to troubleshoot high CPU/memory issues in Linux. ✅ How to debug a slow MySQL query. ✅ What happens when a Pod crashes in Kubernetes?
5️⃣ Use Your SRE Knowledge to Your Advantage
If they ask about programming, redirect to your strengths: Example Response: "While I’m not a full-time developer, I am proficient in writing automation scripts for infrastructure monitoring, incident handling, and system debugging. I use Python/Shell for tasks like log analysis, alerting, and service health checks."
🔥 Final Game Plan
1️⃣ Brush up on Shell/Python for automation (no need for deep coding knowledge). 2️⃣ Prepare real-world incident handling stories. 3️⃣ Practice debugging and system performance troubleshooting. 4️⃣ Use monitoring tools (Grafana, ELK, Prometheus) to analyze data.
Would you like a mock interview session to practice coding/debugging questions? 🚀
Last updated