SRE
Linux QA
Absolutely! Below are 20 commonly asked Linux-related questions in Google SRE interviews, along with answers that reflect what a top candidate would say — someone who understands the why, not just the how.
🧠 Top 20 Google SRE Linux Interview Questions (with Ideal Answers)
1. What happens when you type google.com in your browser?
google.com in your browser?Best Answer:
When you type
google.com, the OS performs a DNS resolution using/etc/resolv.conf. Once it gets the IP, the browser opens a TCP connection (via 3-way handshake) to port 443. TLS handshake secures the connection. Then it sends an HTTP GET request. The server responds with data, and the browser renders it. Along the way, ARP resolves MAC addresses, and IP routing decides packet delivery.
2. How does ps get process information?
ps get process information?Best Answer:
The
pscommand reads from the/procvirtual filesystem, especially files like/proc/[pid]/stat,/proc/[pid]/cmdline, etc. It also uses/proc/statand other kernel interfaces to gather CPU and memory stats.
3. Explain top command output.
top command output.Best Answer:
topshows real-time system stats. Key fields:
%CPU: CPU usage by process
%MEM: Memory usage
LOAD AVG: 1/5/15 min average processes in queue
NI: Process niceness (priority) It uses
/procto collect this data and refreshes every few seconds.
4. Difference between a process and a thread?
Best Answer:
A process is an independent execution unit with its own memory space. A thread is a lightweight subunit of a process that shares memory. Threads are faster to create and switch, but processes offer better isolation.
5. How do you check if a port is open and listening?
Best Answer:
sudo lsof -i :8080
netstat -tuln | grep 8080
ss -tuln | grep 8080Or use
nc -zv localhost 8080to probe a port.ssis preferred in modern Linux systems.
6. What’s the difference between hard and soft links?
Best Answer:
Hard link: Same inode, file remains even if the original is deleted.
Soft (symbolic) link: Points to pathname. If the target is deleted, the link breaks.
7. What is a zombie process?
Best Answer:
A zombie is a dead process whose parent hasn't read its exit status via
wait(). It appears inpswith<defunct>. It doesn’t use CPU/memory but wastes a PID. Proper parent processes reap zombies to avoid PID exhaustion.
8. How does Linux manage memory?
Best Answer:
Linux uses virtual memory, with pages mapped to physical RAM. Unused pages may be swapped. The kernel uses page cache to cache disk IO. Tools like
free,vmstat, and/proc/meminfoshow memory usage.
9. What is the OOM killer?
Best Answer:
The Out of Memory (OOM) killer is triggered when memory is exhausted. It chooses a process to kill based on a badness score, freeing memory. It’s a last resort to avoid system crash.
10. How do you find the top memory-consuming processes?
Best Answer:
ps aux --sort=-%mem | head
top
htopOr check
/proc/[pid]/statusforVmRSS,VmSize.
11. What does nice and renice do?
nice and renice do?Best Answer:
nicesets a process's initial niceness (priority), ranging from -20 (high priority) to 19 (low).renicechanges the priority of a running process.
12. How does the kernel schedule processes?
Best Answer:
Linux uses Completely Fair Scheduler (CFS). It tracks "virtual runtime" to ensure each process gets fair CPU time. It maintains a red-black tree to pick the task with the lowest virtual runtime.
13. Explain runlevels or systemd targets.
Best Answer:
Traditional runlevels (e.g., 3 for multi-user, 5 for GUI) are replaced by systemd targets (like
multi-user.target,graphical.target). Usesystemctl get-defaultandsystemctl isolateto view/change them.
14. What happens during Linux boot process?
Best Answer:
BIOS/UEFI initializes hardware
Bootloader (GRUB) loads the kernel
Kernel initializes drivers, mounts root FS
initorsystemdrunsSystem targets/services are started
15. How do you monitor disk IO?
Best Answer:
Use
iostat,iotop, orvmstat 1.iostatshows per-device read/write stats.iotopshows live disk IO per process.
16. How can you find out which process is using a file or port?
Best Answer:
lsof /path/to/file
lsof -i :80
fuser /path/to/fileThese map file descriptors or ports to processes.
17. What’s the difference between /dev/null and /dev/zero?
/dev/null and /dev/zero?Best Answer:
/dev/nulldiscards anything written to it (bit bucket)./dev/zerooutputs null bytes (\0) when read. It’s used for allocating zeroed memory.
18. How do you check for file descriptor leaks?
Best Answer:
ls /proc/<pid>/fd | wc -l
lsof -p <pid>Compare open FDs over time to detect leaks.
ulimit -nshows FD limit.
19. How do cgroups work?
Best Answer:
Cgroups (Control Groups) limit and isolate CPU, memory, IO, etc. per process group. They form the backbone of Docker/K8s resource limits. Use
systemdorcgcreate/cgexecto manage them manually.
20. How do you analyze a Linux crash (kernel panic)?
Best Answer:
Enable kdump to capture crash dumps. Analyze using
crashutility andvmcore. Check/var/log/messagesand dmesg logs. If kdump is not enabled, use console logs and hardware LED indicators (IPMI).
✅ Bonus Tips for Google SRE Interviews
Don't just explain what — explain why and how things work under the hood.
Tie answers to real incidents you’ve handled (e.g., “We once had a memory leak causing OOM kills…”).
Show awareness of scale, observability, and automation.
Be honest when you don’t know — but always say how you’d find the answer.
That’s a very fair point — and you're absolutely right. The earlier sets include deep-dive Linux questions which are often reserved for senior SREs or during escalation rounds.
For someone with 3–5 years experience, especially applying for L3 or L4 SRE roles, Google (and other top companies) will focus more on practical knowledge, troubleshooting, and real-world experience rather than kernel internals or niche topics.
✅ Here’s a more realistic and experience-level-appropriate set of 20 Linux questions — with solid, real-world answers.
These are the kind of questions a mid-level DevOps/SRE is expected to handle well.
1. How do you check system resource usage in Linux?
Answer:
I use:
toporhtopfor live CPU, memory, process info
free -hfor memory summary
df -hfor disk usage
du -sh *to check folder size
uptimefor load averages
2. How do you troubleshoot high CPU usage?
Answer:
I run
toporhtopto identify the high-CPU process. Thenps aux --sort=-%cpu | headto confirm. If it’s a specific service, I check logs, open file handles (lsof), and restart or scale as needed.
3. How do you schedule a job in Linux?
Answer:
I use
crontab -efor recurring jobs. For one-time tasks,atis useful. For example:
0 2 * * * /opt/scripts/db_backup.shruns the job daily at 2 AM.
4. How do you find which process is using a port?
Answer:
lsof -i :8080ornetstat -tulnp | grep 8080orss -ltnp— they show which PID is listening on a port.
5. How do you free up disk space?
Answer:
Clear logs from
/var/log(e.g., rotated logs)Clean apt/yum cache
Remove unused Docker images:
docker system pruneFind large files with
du -ah / | sort -rh | head
6. How do you find memory-consuming processes?
Answer:
ps aux --sort=-%mem | headAlso
toporhtopsorted by%MEM. I also check for memory leaks if a process steadily grows.
7. What’s the difference between systemctl stop and kill?
systemctl stop and kill?Answer:
systemctl stopgracefully stops the service via systemd.killsends a signal to the process — can beSIGTERM(soft) orSIGKILL(forceful).systemctlis cleaner as it handles dependencies.
8. How do you check service logs?
Answer:
For systemd services:
journalctl -u nginx.serviceOr check
/var/log/<service-name>/for traditional logs.
9. How do you find slow disk IO issues?
Answer:
I use:
iostat -xm 1to monitor device IOiotopto find which processes are using most IOdstatfor overall system stats
10. What’s the use of /etc/hosts?
/etc/hosts?Answer:
It maps hostnames to IP addresses before DNS is used. Useful for overriding DNS temporarily or internal testing.
11. How do you monitor a running log file in real-time?
Answer:
tail -f /var/log/syslogOr use
multitail,less +F, or piping throughgrepto watch specific keywords.
12. What’s the difference between reboot, shutdown, halt, and poweroff?
Answer:
reboot: Restart system
shutdown -r now: Same as reboot
halt: Stops CPU without powering off
poweroff: Powers off the system completely
13. How do you check if a service is enabled on boot?
Answer:
systemctl is-enabled nginxYou can also run
systemctl list-unit-files | grep enabled.
14. How do you list all running processes?
Answer:
ps aux
top
htop15. What is the difference between chmod 755 and chmod 644?
chmod 755 and chmod 644?Answer:
755: Read/write/execute for owner, read/execute for group/others (used for scripts and executables).
644: Read/write for owner, read-only for group/others (used for config and text files).
16. How do you check system uptime and reboot history?
Answer:
uptime
who -b
last reboot17. How do you find failed login attempts?
Answer:
cat /var/log/auth.log | grep "Failed password"
journalctl -u sshd
18. How do you create and extract a .tar.gz file?
.tar.gz file?Answer:
tar -czvf backup.tar.gz /my/folder/
tar -xzvf backup.tar.gz19. How do you set an environment variable permanently?
Answer:
Add
export VAR=valueto:
~/.bashrcor~/.bash_profile(user)/etc/environment(system-wide)
20. How do you troubleshoot “Permission Denied” errors?
Answer:
Use
ls -lto check file permissions
Use
idorgroupsto check user/groupUse
chmod,chownto fix ownershipAlso check if the file has
noexecor if SELinux/AppArmor is blocking it
🎯 Final Thought
With 3–5 years of experience, you should:
Focus on practical skills
Show that you debug logically under pressure
Back up your answers with real experience ("We had a disk full issue on EC2...")
You're doing awesome — and since you're serious about Google SRE prep with 3–5 years of experience, here's another fresh batch of 20 Linux interview questions (non-repeating) that match your real-world, mid-level SRE/DevOps profile.
These will test your daily hands-on skills, problem-solving, and environment awareness — exactly what Google and similar companies look for.
🧠 Linux Interview Questions – Set #3 (Mid-Level, Practical, Non-Repeating)
1. How do you find which user ran a command recently on a server?
Answer:
I check shell history files like
~/.bash_history, and uselastorwhofor login history. For auditing, I check/var/log/auth.logor useauditdif it’s set up.
2. How do you restart a service if it crashes unexpectedly?
Answer:
In systemd-based systems, I add:
[Service]
Restart=always
RestartSec=5in the unit file or override config. This ensures automatic restart after failure.
3. How do you list files modified in the last 1 hour?
Answer:
find /path -type f -mmin -604. How do you kill all processes by a specific user?
Answer:
pkill -u usernameor
kill -9 $(ps -u username -o pid=)5. How do you copy files from one server to another securely?
Answer:
scp /file user@remote:/path
rsync -avz -e ssh /file user@remote:/path
rsyncis better for large or incremental transfers.
6. What is the nohup command and when do you use it?
nohup command and when do you use it?Answer:
nohupruns a command immune to hangups (terminal close), logs output tonohup.out. Useful for long background tasks.
7. How do you track cron job failures?
Answer:
Redirect output to a log file:
* * * * * /script.sh >> /var/log/script.log 2>&1
Or check
/var/log/syslogorjournalctl -u cron.
8. How do you find the file system type of a partition?
Answer:
df -T
lsblk -f
blkid9. How do you unmount a busy filesystem?
Answer:
Use:
lsof /mount/point
fuser -vm /mount/pointThen stop the processes and
umount.
10. How do you list open network connections on a Linux server?
Answer:
ss -tunap
netstat -anp
lsof -i11. How do you see memory usage per process (accurately)?
Answer:
I use
smemor check/proc/<pid>/smaps.topshowsRES(resident memory), but smem gives more accurate PSS (proportional set size).
12. How do you reduce swap usage on a server?
Answer:
I check why swap is used (
top,vmstat). Then:
swapoff -a && swapon -aOr tune swappiness:
sysctl vm.swappiness=1013. How do you display human-readable permissions (rwx) with numeric mode?
Answer:
stat -c "%A %a %n" *14. How do you find the default gateway and DNS in Linux?
Answer:
ip route | grep default
cat /etc/resolv.conf15. How do you test if DNS resolution is slow?
Answer:
dig google.com +statsCompare timing. Also try:
time curl http://domain.com16. How do you limit CPU usage of a process?
Answer:
Use
cpulimitorcgroups:
cpulimit -p <pid> -l 20or create a cgroup with
cpu.cfs_quota_us.
17. How do you check which kernel version you're running?
Answer:
uname -r
cat /proc/version18. How do you analyze server reboots or crashes?
Answer:
Use:
last reboot
journalctl --since "1 hour ago"Check
dmesg,syslog, or if configured, check crash dumps.
19. How do you sync system time with NTP?
Answer:
timedatectl status
timedatectl set-ntp trueor install
ntporchronyand sync with pool servers.
20. How do you check available entropy? Why does it matter?
Answer:
cat /proc/sys/kernel/random/entropy_availLow entropy can slow down cryptographic operations (like SSH, SSL). Fix using
rng-toolsorhaveged.
You're on 🔥 — and I love that you're serious about mastering your SRE/Linux interview game.
Here's Set #4: another fresh 20 Linux interview questions with zero repeats, curated specifically for 3–5 year experienced DevOps/SRE engineers. These focus on troubleshooting, scripting, file systems, networking, and command-line logic — the exact stuff Google and other big players expect from solid mid-level candidates.
🧠 Linux Interview Questions – Set #4 (Mid-Level, Non-Repeating)
1. How do you check if a Linux server is running out of inodes?
Answer:
df -iIf
%iusedis high, you may need to delete small files (e.g., logs, tmp files). Inode exhaustion causes "No space left on device" even when space exists.
2. How do you list the 10 largest files in a directory tree?
Answer:
find /path -type f -exec du -h {} + | sort -rh | head -n 103. How do you find recently installed packages in Linux?
Answer:
On Debian/Ubuntu:
grep " install " /var/log/dpkg.logOn RHEL/CentOS:
rpm -qa --last4. How do you recursively change permissions only for files or only for directories?
Answer:
Files only:
find . -type f -exec chmod 644 {} +Directories only:
find . -type d -exec chmod 755 {} +5. How do you monitor real-time network usage per process?
Answer:
Use
nethogsor:
iftop
ip -s link6. How do you see which shared libraries a binary uses?
Answer:
ldd /usr/bin/nginx7. How do you compare two directories and find differences?
Answer:
diff -r /dir1 /dir2Or use
rsync -an --deletefor a dry run of syncing changes.
8. How do you find all files owned by a user?
Answer:
find / -user username9. How do you test whether a variable in a bash script is empty?
Answer:
if [ -z "$VAR" ]; then
echo "Empty"
fi10. What’s the difference between > and >> in shell redirection?
> and >> in shell redirection?Answer:
>overwrites a file,>>appends to it. Useful in logging scripts to avoid log loss.
11. How do you make a script executable from anywhere in the system?
Answer:
Add shebang (
#!/bin/bash) at the topMake executable:
chmod +x script.shMove it to a directory in
$PATHlike/usr/local/bin/
12. How do you limit the number of processes a user can run?
Answer:
Set in
/etc/security/limits.conf:
username hard nproc 10013. How do you pause a running process and resume it later?
Answer:
Pause with
Ctrl+ZResume in foreground:
fgResume in background:
bg
14. How do you rename all .log files to .bak in a folder?
.log files to .bak in a folder?Answer:
for f in *.log; do mv "$f" "${f%.log}.bak"; done15. How do you test network connectivity without using ping or curl?
ping or curl?Answer:
nc -zv google.com 80
telnet google.com 80Or check with
traceroute,dig, orip route get.
16. How do you set a static IP address temporarily and permanently?
Answer:
Temporarily:
ip addr add 192.168.1.100/24 dev eth0Permanently: edit
/etc/network/interfaces(Debian) or usenmclior Netplan.
17. How do you make an alias permanent?
Answer:
Add it to
~/.bashrcor~/.bash_aliases, e.g.:
alias ll='ls -alF'18. How do you extract .tar.gz but only specific files/folders?
.tar.gz but only specific files/folders?Answer:
tar -xzvf archive.tar.gz folder/file1.txt19. How do you make a script run at system boot?
Answer:
For systemd-based:
sudo cp script.sh /etc/systemd/system/myscript.service
sudo systemctl enable myscript.serviceOr use
@rebootin crontab:
@reboot /path/to/script.sh20. How do you limit file upload size in a Linux-based web server (like Nginx)?
Answer:
Set:
client_max_body_size 20M;in Nginx config, then reload the service.
Kuberntes
Great! Since you're preparing for a Google SRE interview with 3–5 years of experience, here's a curated set of 20 Kubernetes interview questions that are:
Frequently asked by Google interviewers
Matched to your experience level
Answered as a top candidate would — showing both practical usage and understanding
🧠 Top 20 Kubernetes Interview Questions for Google SRE (3–5 Years Experience)
1. What is Kubernetes and why is it used?
Best Answer:
Kubernetes is an open-source container orchestration platform used to deploy, manage, and scale containerized applications. It automates tasks like service discovery, scaling, load balancing, rolling updates, and self-healing.
2. How does Kubernetes achieve high availability for applications?
Best Answer:
It runs multiple replicas of a pod behind a Service. If one pod fails, the others continue serving traffic. Readiness probes and ReplicaSets ensure that only healthy pods serve traffic. HorizontalPodAutoscaler and anti-affinity rules help scale and distribute them.
3. What happens when you run kubectl apply -f deployment.yaml?
kubectl apply -f deployment.yaml?Best Answer:
The manifest is sent to the API server, which stores the desired state in etcd. The controller manager compares current vs desired state and schedules pods as needed using kube-scheduler. Then kubelet runs pods on the target node.
4. What is the difference between a Deployment, ReplicaSet, and StatefulSet?
Best Answer:
Deployment: Manages stateless apps with rolling updates via ReplicaSets.
ReplicaSet: Ensures a set number of identical pods are running.
StatefulSet: For stateful apps with stable pod identity (DNS, storage) — e.g., databases.
5. How do you roll back a deployment in Kubernetes?
Best Answer:
kubectl rollout undo deployment <name>Kubernetes stores rollout history. You can also specify a revision with
--to-revision.
6. How does Kubernetes perform health checks?
Best Answer:
Via liveness and readiness probes:
Liveness: Tells if the container is alive — restarts it if it fails.
Readiness: Tells if the container is ready to accept traffic.
7. How does a Service work in Kubernetes?
Best Answer:
A Service provides a stable IP and DNS name for a set of pods. It uses labels to match pods. Internally, kube-proxy maintains iptables or IPVS rules to forward traffic to backend pods.
8. What is a ConfigMap and how is it used?
Best Answer:
ConfigMaps externalize configuration from code. You can mount them as volumes or inject as environment variables. This allows updating config without rebuilding images.
9. What’s the difference between ConfigMap and Secret?
Best Answer:
ConfigMap: Stores plaintext config
Secret: Stores sensitive data (base64 encoded, not encrypted by default) Secrets are mounted with stricter permissions and optionally encrypted at rest.
10. What are Namespaces in Kubernetes?
Best Answer:
Namespaces logically isolate resources in a cluster. Useful in multi-tenant setups or separating environments (dev/stage/prod). Resources like pods, services, and configmaps are namespace-scoped.
11. How do you debug a pod stuck in CrashLoopBackOff?
CrashLoopBackOff?Best Answer:
kubectl describe pod <pod>
kubectl logs <pod>I check container exit codes, logs, and readiness/liveness probe failures. Often it's config issues, failed DB connections, or missing secrets.
12. What is a DaemonSet and use case?
Best Answer:
A DaemonSet ensures a pod runs on every (or specific) node. Commonly used for log shippers (Fluentd), monitoring agents (Prometheus node exporter), or CSI drivers.
13. How do you limit resource usage of a pod?
Best Answer:
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"This helps the scheduler place the pod correctly and prevents noisy neighbor issues.
14. What is a Node in Kubernetes?
Best Answer:
A node is a worker machine (VM or bare metal) in the cluster that runs pods. It includes kubelet, container runtime, and kube-proxy.
15. How does Kubernetes schedule pods to nodes?
Best Answer:
The kube-scheduler selects nodes based on:
resource availability (CPU/mem)
taints/tolerations
node/pod affinity
constraints (e.g.,
nodeSelector,nodeAffinity) Then binds the pod to the best-fit node.
16. What is a Taint and Toleration?
Best Answer:
Taints are labels on nodes that repel pods. Tolerations allow pods to “tolerate” those taints. Example: Taint master nodes to avoid regular pod scheduling.
17. What are Init Containers?
Best Answer:
Init containers run before app containers in a pod. They are ideal for tasks like pulling config, waiting for a service, or setting up environment. They must complete before the main container starts.
18. How do you expose an application outside the cluster?
Best Answer:
Use NodePort for basic external access
Use LoadBalancer (in cloud)
Use Ingress for path- or host-based routing
19. What is an Ingress and why use it?
Best Answer:
Ingress is a Kubernetes object that manages external access to services via HTTP/HTTPS. It supports routing based on paths and domains. Requires an Ingress Controller (e.g., NGINX, Traefik).
20. How do you monitor a Kubernetes cluster?
Best Answer:
I use Prometheus + Grafana for metrics, Alertmanager for alerts, and tools like:
kubectl top(via Metrics Server)LensorK9sfor visualizationkubectl describeandlogsfor debugging Logging: EFK/ELK stack or Loki.
✅ Bonus Tips for Google SRE Interview
Be ready to whiteboard how components interact (API server, kubelet, etcd, etc.)
Emphasize how you handled real outages, debugging, or scaling
Explain why you use a certain Kubernetes feature, not just how
Awesome! Here’s Set #2 of 20 fresh Kubernetes interview questions for Google SRE candidates with 3–5 years experience — no repeats, real-world focus, and answered like a top-tier engineer.
🧠 Kubernetes Interview Questions – Set #2 (Google SRE, No Repeats)
1. How does Kubernetes handle pod failures?
Best Answer:
The kubelet detects the failure and notifies the API server. The ReplicaSet or StatefulSet controller creates a new pod to maintain the desired replica count. Liveness probes trigger restarts if needed.
2. What’s the difference between RollingUpdate and Recreate strategies?
Best Answer:
RollingUpdate: Gradually replaces old pods with new ones, ensuring no downtime.
Recreate: Terminates all old pods before starting new ones — might cause downtime.
3. What is etcd and why is it important?
Best Answer:
etcdis a consistent, distributed key-value store used as Kubernetes' backing store. It holds all cluster data (state, secrets, configuration). If etcd fails, the cluster loses state.
4. What happens if a node goes down in Kubernetes?
Best Answer:
After a grace period (default 5m), the node is marked
NotReady, and its pods are evicted. The scheduler reassigns the pods to healthy nodes if pod disruption budgets allow.
5. What is a Pod Disruption Budget (PDB)?
Best Answer:
A PDB defines the minimum number of pods that must be available during voluntary disruptions (like upgrades). It helps avoid complete app downtime during node drain or rolling updates.
6. How do you handle secret rotation in Kubernetes?
Best Answer:
Secrets are stored as objects. I:
Use volume mounts for apps that re-read secrets
Automate rotation via CI/CD
Use External Secrets Operator or integrate with HashiCorp Vault
7. How do you create a Kubernetes Job and CronJob?
Best Answer:
Job runs a task to completion (e.g., backup).
CronJob schedules Jobs periodically.
schedule: "0 3 * * *"
jobTemplate:
spec:
template:
spec:
containers:8. What is the role of the Kubernetes Controller Manager?
Best Answer:
It runs control loops (Deployment, ReplicaSet, Node, etc.) that constantly reconcile desired state (from etcd) vs actual state (from kubelet).
9. How do you isolate workloads in Kubernetes?
Best Answer:
I use:
Namespaces to group resources
NetworkPolicies to restrict traffic
RBAC for access control
ResourceQuotas to limit usage per team
10. What’s the difference between kubectl get, describe, logs, and exec?
kubectl get, describe, logs, and exec?Best Answer:
get: shows summary
describe: detailed metadata, eventslogs: container stdout/stderrexec: run commands inside the pod (like SSH)
11. How do you view events in a Kubernetes cluster?
Best Answer:
kubectl get events --sort-by='.lastTimestamp'Helps in troubleshooting startup failures, scheduling issues, probe failures, etc.
12. What are Admission Controllers in Kubernetes?
Best Answer:
They intercept API requests after authentication/authorization but before persistence. Used for validating policies, mutating resources, enforcing security (e.g., PodSecurityAdmission).
13. How do you perform zero-downtime deployments?
Best Answer:
Use:
RollingUpdatestrategyReadiness probes
Proper
maxUnavailableandmaxSurgesettingsPreStop hooks to delay pod termination
14. What are sidecar containers and how are they used?
Best Answer:
Sidecars run alongside the main app container in the same pod. Used for logging, proxying, authentication (e.g., Envoy for Istio), or config reloading.
15. What is Horizontal Pod Autoscaler (HPA)?
Best Answer:
HPA scales pods based on CPU/memory or custom metrics. It queries metrics-server and adjusts replicas using a target utilization percentage.
16. What is Vertical Pod Autoscaler (VPA)?
Best Answer:
VPA adjusts pod CPU/memory requests/limits based on usage. It doesn’t increase pod count, just resource allocation. Useful for apps with dynamic load profiles.
17. How do you drain a node safely for maintenance?
Best Answer:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-dataThis evicts pods, respecting PDBs.
cordonprevents new pods before drain.
18. How do you implement RBAC in Kubernetes?
Best Answer:
Define:
Role or ClusterRole (permissions)
RoleBinding or ClusterRoleBinding (bind to users/service accounts)
19. How do you enforce network segmentation between namespaces?
Best Answer:
With NetworkPolicies:
podSelector: {}
policyTypes: [Ingress, Egress]Deny all by default, then explicitly allow.
20. What’s the difference between Kubelet and Kube-proxy?
Best Answer:
Kubelet: Agent on each node that manages containers and reports status to API server.
Kube-proxy: Handles service-to-pod networking (iptables/IPVS rules) to load-balance traffic.
✅ Recap
You now have:
✅ 40 unique Kubernetes interview questions
✅ Real-world SRE use cases
✅ Google SRE–friendly answers with clarity and depth
Love the energy — you're crushing this! 🔥 Here’s Set #3: 20 more non-repeating, real-world Kubernetes interview questions focused on SRE responsibilities, cluster reliability, troubleshooting, and production readiness — perfect for Google SRE roles at the 3–5 year level.
🧠 Kubernetes Interview Questions – Set #3 (Google SRE, 3–5 YOE, No Repeats)
1. How do you perform a rolling restart of a deployment in Kubernetes?
Best Answer:
kubectl rollout restart deployment <name>This triggers a rolling update without changing the manifest — useful for config changes or debugging.
2. What’s the difference between emptyDir, hostPath, and PersistentVolume?
emptyDir, hostPath, and PersistentVolume?Best Answer:
emptyDir: Temporary, deleted with pod
hostPath: Mounts a node’s filesystem (not portable)
PersistentVolume: Decoupled, reusable storage backed by cloud/NFS/etc
3. How do you scale a Kubernetes deployment manually?
Best Answer:
kubectl scale deployment myapp --replicas=5Or change
replicas:in the YAML andapply.
4. What is a ServiceAccount and why use it?
Best Answer:
A ServiceAccount allows a pod to access the Kubernetes API securely. It's commonly used for automation, monitoring agents, and any app using
kubectlor client-go SDK.
5. How do you ensure logs from all pods are collected and searchable?
Best Answer:
Use a sidecar agent (Fluentd/FluentBit) or DaemonSet to collect logs, then send to Elasticsearch, Loki, or Cloud Logging. Prefer stdout/stderr over writing to files.
6. How do you limit access to sensitive secrets in a cluster?
Best Answer:
Use RBAC to restrict who can
get/list/watchsecretsMount secrets only in specific pods/namespaces
Enable encryption at rest for etcd
Rotate secrets using external tools like Vault
7. How do you check which scheduler decisions were made for a pod?
Best Answer:
kubectl describe pod <pod-name>Look under the Events section for messages like "Successfully assigned pod to node".
8. What is a PreStop hook and how does it help with graceful shutdown?
Best Answer:
A PreStop hook runs before
SIGTERM, giving the app time to clean up. Combined withterminationGracePeriodSeconds, it prevents traffic cut-off before readiness probe fails.
9. How do you inject environment-specific variables into Kubernetes pods?
Best Answer:
Use ConfigMaps/Secrets as env vars
Use
valueFromin the pod spec to pull from metadata (e.g., pod name, namespace)
10. How does Kubernetes handle certificate rotation for TLS communication?
Best Answer:
kubelet and controller-manager auto-renew their certs. For workloads, I integrate cert-manager to manage auto-renewal of TLS certs from Let’s Encrypt or an internal CA.
11. How do you investigate kubelet issues on a node?
Best Answer:
journalctl -u kubeletI check for node pressure (disk, memory), kubelet version mismatches, cgroup errors, or volume mount issues.
12. What is a MutatingAdmissionWebhook?
Best Answer:
It modifies incoming objects before persistence. Used to inject sidecars (e.g., Istio), enforce default labels, or add env vars. Complementary to validating webhooks.
13. What does kube-proxy do and what are its operating modes?
kube-proxy do and what are its operating modes?Best Answer:
It manages network rules to route traffic to services. It supports:
iptables mode (default): rules in iptables
IPVS mode: faster, kernel-level load balancing
14. How do you troubleshoot pods stuck in Pending state?
Pending state?Best Answer:
Check:
kubectl describe pod <pod>Common causes:
Not enough resources
Node selectors or taints
PVC unbound
Affinity rules not met
15. How do you ensure only signed container images run in your cluster?
Best Answer:
Use imagePolicyWebhook or tools like Kyverno, OPA Gatekeeper, or Cosign with Sigstore to enforce signature verification before admission.
16. How do you rotate service account tokens in a pod?
Best Answer:
Kubernetes automatically mounts and refreshes tokens. To rotate manually:
Disable automount
Use external secret injector (Vault Agent Injector)
Refresh token via projected volumes
17. What is kube-state-metrics and how is it used in monitoring?
kube-state-metrics and how is it used in monitoring?Best Answer:
It exports cluster state (pod status, replica count, resource usage) as Prometheus metrics. I use it to alert on unhealthy pods, unscheduled workloads, and configuration drift.
18. How do you back up and restore etcd in Kubernetes?
Best Answer:
ETCDCTL_API=3 etcdctl snapshot save backup.db
etcdctl snapshot restore backup.dbUseful for disaster recovery. Only the control plane node can perform this.
19. What is node affinity and anti-affinity?
Best Answer:
Affinity: schedule pods on matching nodes
Anti-affinity: avoid placing pods on the same node as others (e.g., HA)
20. How do you ensure every namespace has a default resource quota and limit range?
Best Answer:
Use LimitRange and ResourceQuota objects applied via automation (ArgoCD, Kyverno), or admission controllers to enforce policies on new namespaces.
✅ You Now Have:
✅ 60 high-quality K8s questions (Google-level, 3–5 YOE)
✅ All real-world, production-focused answers
✅ A full grasp of how SREs troubleshoot, monitor, and secure clusters
You're doing an amazing job staying consistent! 💪 Here’s your Set #4 (you already have Sets 1, 2, and 3), with 20 more advanced yet practical Kubernetes interview questions, curated for a Google SRE/DevOps role with 3–5 years experience.
These questions cover multi-cluster, security, performance tuning, production best practices, and edge cases — what real SREs face and what top-tier interviewers love to ask.
🧠 Kubernetes Interview Questions – Set #4 (Google SRE, No Repeats, 3–5 YOE)
1. How do you design a multi-tenant Kubernetes cluster securely?
Best Answer:
Use separate Namespaces
Apply strict RBAC and NetworkPolicies
Apply ResourceQuotas and LimitRanges
Use OPA Gatekeeper or Kyverno for policy enforcement
Prefer workload isolation using PodSecurityStandard or GKE Autopilot
2. How do you limit the impact of a noisy neighbor in Kubernetes?
Best Answer:
I use
requestsandlimitson CPU/memory, useLimitRangesin each namespace, and apply PodPriority + Preemption to ensure critical pods are never evicted due to resource starvation.
3. How do you reduce image pull latency for large container images?
Best Answer:
Use smaller base images (
distroless,alpine)Enable imagePullPolicy: IfNotPresent
Pre-pull images using DaemonSets
Use a local registry mirror
4. What is PriorityClass in Kubernetes?
Best Answer:
It's an object that assigns priority to pods. Higher-priority pods are scheduled first and can preempt lower-priority pods if resources are tight.
5. How do you control where pods are scheduled in multi-zone clusters?
Best Answer:
Use:
nodeAffinitytopologySpreadConstraintspodAntiAffinity
These help distribute pods across zones for HA.
6. How do you perform canary deployments in Kubernetes?
Best Answer:
I use multiple Deployments or Rollouts (with Argo Rollouts) and split traffic using:
Service selector changes
Ingress annotations
Istio/Linkerd routing rules
7. How do you manage Kubernetes manifests at scale?
Best Answer:
I use Helm for templating, Kustomize for overlays, and GitOps tools like ArgoCD or Flux for promotion across environments.
8. What tools do you use for cluster-wide policy enforcement?
Best Answer:
OPA Gatekeeper for Rego-based policies
Kyverno for YAML-native rules
K-Rail, JSPolicy, or admission webhooks for specific use cases
9. What’s the difference between initContainers and postStart hooks?
initContainers and postStart hooks?Best Answer:
initContainers: run before main container starts
postStart: hook in the main container lifecycle, may not block startup I useinitContainerswhen startup dependency order matters.
10. How do you monitor the control plane components in Kubernetes?
Best Answer:
kube-apiserver,etcd,kube-scheduler,controller-managerexpose metricsScrape them via Prometheus
Check their logs (
journalctl,kubectl logs)Use liveness probes and alerting on abnormal behavior or restart frequency
11. How do you secure etcd in a Kubernetes cluster?
Best Answer:
Enable TLS for peer/client communication
Restrict etcd access to the API server only
Encrypt etcd at rest (
--encryption-provider-config)Use etcd snapshots for secure backups
12. How do you protect against container escape vulnerabilities?
Best Answer:
Run containers as non-root
Drop Linux capabilities
Use Seccomp and AppArmor/SELinux profiles
Enable PodSecurity admission
13. How do you validate changes before applying them to production?
Best Answer:
Use CI/CD pipelines with
kubectl diffValidate YAML with
kubeval,kube-score,conftestDeploy to staging namespace with shadow traffic
Use dry-run mode:
kubectl apply --dry-run=client -f my-deployment.yaml14. How do you troubleshoot high CPU usage by a pod?
Best Answer:
Check
kubectl top pod
kubectl execinto the pod and runtop,ps,straceCheck for tight loops or unbounded resource use
Use
kubectl debugorephemeral containersfor live debugging
15. How do you monitor node health across clusters?
Best Answer:
Use node-exporter + Prometheus
Alert on node conditions like
DiskPressure,MemoryPressure,PIDPressureUse
taintsto remove unhealthy nodesFor multi-cluster, I aggregate data using Thanos or Prometheus Federation
16. How do you upgrade Kubernetes clusters with zero downtime?
Best Answer:
Upgrade control plane first
Then drain and upgrade nodes one by one
Use PDBs to ensure availability
Validate API deprecations before upgrade
Use managed services (like GKE, EKS) or tools like kubeadm, kOps
17. How do you configure custom metrics for HPA?
Best Answer:
Expose custom metrics via
/metricsendpointUse Prometheus Adapter to make them available to HPA
Define HPA based on those metrics:
metrics:
- type: External
external:
metricName: queue_length18. How do you analyze a failed pod that’s already terminated?
Best Answer:
Use
kubectl get pod -aor--field-selector=status.phase=FailedCheck
kubectl describefor exit codesView logs with
kubectl logs --previous
19. How do you prevent privilege escalation in Kubernetes?
Best Answer:
Set
allowPrivilegeEscalation: falsein the PodSpecDrop capabilities like
CAP_SYS_ADMINDisallow host networking, IPC, PID, or hostPath volumes
Use admission controllers to enforce policies
20. How do you manage secret sprawl in large Kubernetes environments?
Best Answer:
Use external secret managers (Vault, AWS Secrets Manager)
Inject via sidecars or CSI drivers
Rotate secrets regularly
Use SecretProviderClass with CSI Secrets Store for dynamic injection
✅ At This Point, You Have:
✅ 80 total Kubernetes questions (4 sets × 20)
✅ Covered real SRE use cases: HA, RBAC, monitoring, debugging, multi-tenancy, scaling, and policy enforcement
✅ Battle-tested answers aligned with Google’s SRE bar
Would you like:
📄 A single PDF with all 80 K8s questions?
🧪 Real troubleshooting lab tasks (with sample cluster failures)?
📊 Prometheus + Grafana interview questions?
🚀 Docker + CI/CD questions for Google SRE final round?
Tell me your next target — I’ll prep it exactly for you.
WHAT IS SRE TASKS
sre tesk
Great question, Pavan! A Senior SRE (Site Reliability Engineer) takes a strategic and proactive approach to SLA/SLO/SLI, ensuring system reliability and performance while balancing innovation and risk. Let’s break it down clearly:
🧠 How a Senior SRE Thinks About SLA/SLO/SLI
SLA (Service Level Agreement)
A business commitment to customers (e.g. 99.9% uptime per month). Violations may have penalties.
SLO (Service Level Objective)
Internal targets used to measure reliability (e.g. 99.95% success rate over 30 days). More aggressive than SLA.
SLI (Service Level Indicator)
Metrics used to measure the user experience (e.g. request latency, error rate, availability).
Mentality
Senior SREs focus on proactive reliability — preventing incidents before SLA breaches happen. They use SLOs to balance reliability vs velocity.
🔧 What They Do to Maintain SLA/SLO/SLI
Define Meaningful SLIs
Examples:
Availability: % of successful HTTP 200sLatency: % of requests below 200msError Rate: 5xx errors / total requests
Set Realistic but Ambitious SLOs
e.g., 99.9% availability with a 0.1% error budget
Monitor Error Budgets
Use burn rate alerts: how fast you're consuming the error budget
Pause risky deployments if SLOs are at risk
Automate Response
Auto-scaling, self-healing (e.g., Kubernetes probes + restarts)
Use runbooks and incident workflows
Postmortems & Root Cause Analysis
Document each SLA breach with impact, root cause, lessons learned, and action items
Blameless Culture
Focus on system/process improvements over blaming individuals
📚 What Documentation They Read / Create
✅ SLI/SLO Definitions
Describes how metrics are measured, where they're collected, thresholds
✅ SLAs
External/customer-facing agreements (usually from product/engineering leadership)
✅ Runbooks
Step-by-step incident recovery instructions
✅ Playbooks
Automated/templated responses to known issues
✅ Postmortems
Detailed analysis of production incidents
✅ Capacity Plans
Forecasting system load and scaling needs
✅ On-Call Handover Docs
Shift responsibilities and open issues summary
📊 How They Make Dashboards in Monitoring Tools (Grafana, Datadog, etc.)
Choose SLIs to Visualize
Request success rate
95th/99th percentile latency
Uptime per region
SLO compliance vs error budget
Use Tools Like
Grafana (with Prometheus, Loki)
Datadog (SLO dashboards)
New Relic, Splunk, CloudWatch
Best Practices
Color-code thresholds (green/yellow/red)
Add burn rate panels for error budgets
Show time ranges: last 1h, 24h, 7d
Include annotations for deployments, incidents
🕵️♂️ What Senior SREs Monitor
🔗 Frontend
Page load time, JS errors, API failures
🌐 API Layer
Request rate, error rate, latency
📦 Services/Microservices
Health checks, pod restarts, memory leaks
🛠️ Infrastructure
CPU, memory, disk, node health, auto-scaler behavior
🔐 Security
Auth failures, rate limiting, firewall logs
📈 Business Metrics
Signup rate, payment errors (when reliability affects revenue)
⚠️ Alerts/Incidents
Latency spikes, high error rates, queue backlogs
🧠 Thought Process in Short
“Is the user happy? Are we delivering what we promised? If not, are we detecting and fixing it before it causes real business pain?”
They think in terms of user experience, failure domains, proactive monitoring, and always work to balance reliability with delivery speed.
If you're building a monitoring setup or defining SLIs/SLOs for your current stack (AWS + K8s + Node + React), I can help you design real examples tailored to your infra. Want that?
incident manager
Minimum 5+ years Hands on experience in incident Management of Cloud & SaaS Managed services.
Proven experience in handling Major Incident Response and Resolution management, Cross functional co-ordination, Communication and stakeholder expectation management.
Proven experience in ITSM Tool Monitoring and Management for Incident, Problem and SLA management.
Technical analysis: Ability to analyze Logs, error messages and Tech report and ability to identify RCA and Remediation.
Drive Team towards key Support KPIs/SLA Compliance.
Key certifications: ITIL Certified /Fundamental Technical certification in Cloud and SaaS Technologies.
Key KPI Demonstrations:
MTTR Reduction
Recurring incident elimination
Incident SLA management
CSAT (Stakeholder expectation Management)
Process improvement (Incident, Problem and SLA Management)
Knowledge management
Soft skills:
Excellent written and verbal communication skills
Ability to cope with Business Ops pressure and handle cross functional team towards Resolution
Major incidents can happen at any time. Ability to cope with Time pressure and be ready to be available round the clock to handle Major Incidents
Ability to monitor incident management processes, Tools and Document and recommend continuous Improvement.
Last updated