Aqui está uma lista com mais 25 perguntas técnicas em inglês, com suas respectivas respostas profissionais e alinhadas ao perfil da vaga, divididas por categorias estratégicas.
☁️ Cloud & Microsoft Azure Observability
Q1: How do you configure and monitor Azure resources to ensure you are meeting service SLA and SLO targets?
Answer: > "I use Azure Monitor Metrics to track Key Performance Indicators (KPIs) like availability and latency. I set up Azure Monitor Alerts based on Service Level Objectives (SLOs), often using error budget burning rates. This allows us to detect if a service is degrading fast enough to threaten our Service Level Agreement (SLA) before the breach actually occurs."
Q2: What is the difference between Azure Monitor Logs and Azure Data Explorer (ADX) / Kusto (KQL), and when do you use each?
Answer: "Azure Monitor Logs is built on top of Azure Data Explorer. I use Log Analytics Workspaces with KQL (Kusto Query Language) for standard operational log analysis, troubleshooting, and creating dashboard charts. I would use a standalone Azure Data Explorer cluster if we needed to retain massive volumes of raw telemetry or custom log data for long-term historical analysis at a lower cost."
Q3: How would you monitor an Azure Kubernetes Service (AKS) environment?
Answer: "I would enable Container Insights in Azure Monitor, which collects memory and processor metrics from controllers, nodes, and containers. For deep application observability inside the cluster, I would deploy an OpenTelemetry collector or use Azure Managed Prometheus and Grafana to scrape application metrics, alongside Fluent Bit for log forwarding."
Q4: What are Azure Resource Health and Azure Service Health, and how do they help an operations team?
Answer: "Azure Service Health notifies us about global Azure service incidents, planned maintenance, or advisories affecting our specific region and subscription. Azure Resource Health diagnoses problems at the individual resource level (e.g., if a specific VM goes down due to an underlying hardware failure). I configure alerts on both to immediately separate infrastructure provider issues from internal application bugs."
Q5: How do you collect logs and metrics from an on-premises server into Azure Log Analytics?
Answer: "I deploy the Azure Monitor Agent (AMA) on the on-premises Windows or Linux server. Then, I configure a Data Collection Rule (DCR) in Azure, defining exactly which Windows Event Logs or Linux Syslog facilities and performance counters should be collected and sent to the Log Analytics Workspace."
🛠️ Observability Architecture & Tooling
Q6: What is OpenTelemetry (OTel), and why is it becoming an industry standard for observability?
Answer: "OpenTelemetry is an open-source observability framework that provides a standardized set of APIs, SDKs, and tooling to generate and export telemetry data (metrics, logs, and traces). It is crucial because it provides vendor neutrality; it prevents vendor lock-in, allowing an organization to change its backend platform (e.g., switching from Datadog to Azure Monitor or Dynatrace) without rewriting the application's instrumentation code."
Q7: Explain the concept of "Synthetic Monitoring" versus "Real User Monitoring" (RUM).
Answer: "Synthetic Monitoring uses simulated scripts to periodically test endpoints, APIs, and user journeys from various global locations to ensure availability and performance proactively. Real User Monitoring (RUM) captures actual telemetry from real human users interacting with the live application. Synthetic is great for baseline availability testing, while RUM is essential for understanding actual user experience and frontend performance."
Q8: What are "Golden Signals" of monitoring, and how do you apply them?
Answer: "The Four Golden Signals of site reliability engineering (SRE) are:
Latency: The time it takes to service a request.
Traffic: A measure of how much demand is being placed on the system (e.g., HTTP requests per second).
Errors: The rate of requests that fail.
Saturation: How 'full' the service is (e.g., memory utilization or thread pool limits).
I apply them by building primary dashboards around these four metrics for every critical service."
Q9: How do you handle log parsing and structured logging, and why is it important?
Answer: "Structured logging means writing logs in a machine-readable format, usually JSON, instead of plain text strings. This is vital because modern log aggregators can automatically index the fields. It allows me to write fast, efficient queries—such as filtering logs by a specific UserID or HttpStatusCode—without relying on heavy, slow regular expressions (Regex)."
Q10: How do you approach Capacity Planning and Trend Analysis using monitoring tools?
Answer: "I look at historical data over long periods (e.g., 30, 90, or 180 days) using linear regression or predictive baseline features in tools like Azure Monitor or Grafana. By analyzing data growth, disk consumption, and CPU trends alongside business growth projections, I can forecast exactly when a cluster or storage array will run out of resources, allowing us to scale proactively and optimize costs."
🖥️ Infrastructure, Systems & Applications (APM)
Q11: What is Application Performance Monitoring (APM), and what value does it bring over infrastructure monitoring?
Answer: "Infrastructure monitoring tracks the health of hardware and OS layers (CPU, RAM, Disk). APM digs inside the application runtime. It monitors code execution, library dependencies, external API HTTP calls, and database query executions. APM allows us to pinpoint specific issues, like an unoptimized loop in the backend code or an external API bottleneck, which standard infrastructure metrics cannot visibility reveal."
Q12: How would you debug an application that is generating a high volume of HTTP 5xx errors?
Answer: "I would use our APM tool to filter the incoming web transactions by the 5xx status code. I would inspect the exceptions and stack traces captured during those specific failed requests. Simultaneously, I would correlate the timing with database performance or external dependencies to see if the 5xx errors are a symptom of a downstream timeout."
Q13: If a server is experiencing an "I/O Wait" bottleneck, what does that mean and how do you fix it?
Answer: "High I/O Wait means the CPU is sitting idle because it is waiting for disk read/write operations to complete. It means the storage subsystem is a bottleneck. To troubleshoot, I check disk metrics like Disk Queue Length, IOPS, and Read/Write Latency. Solutions include optimizing database queries to reduce disk hits, adding caching layers (like Redis), upgrading to faster storage (SSD/Premium SSD), or splitting high-I/O workloads onto separate disks."
Q14: How do you monitor memory leaks in a production environment?
Answer: "I track the memory utilization trend over a long period. A memory leak typically shows a steady, staircase-like upward line in RAM usage that never drops back to the baseline, even during low-traffic hours, until the process crashes or restarts (OOM - Out of Memory). I set up alerts for continuous upward deviation and use APM tools or heap profilers to inspect what objects are retaining memory."
Q15: What is network jitter, and how do you monitor network performance for critical services?
Answer: "Jitter is the variance in time delay between data packets over a network, which causes instability in real-time applications. To monitor network performance, I track packet loss, latency (RTT - Round Trip Time), bandwidth utilization, and jitter using network monitoring agents or synthetic probes running ICMP/TCP checks between our distributed environments and cloud hubs."
🗄️ Database & Storage Monitoring
Q16: How do you identify a SQL injection or anomalous activity at the database monitoring level?
Answer: "I monitor database logs and metrics for unusual query patterns. A sudden spike in failed authentication attempts, an unexpected surge in the volume of queries executed, execution of unusual administrative commands (like DROP or ALTER), or a massive spike in data egress from specific tables can all trigger automated anomaly detection alerts indicating a potential attack."
Q17: What metric indicates that a database instance needs more RAM?
Answer: "The most reliable metric is Buffer Pool Cache Hit Ratio (or Buffer Cache Hit Ratio). It measures how often the database finds data pages in memory versus having to read them from the slow disk. If this ratio drops significantly (e.g., below 95% for heavy OLTP workloads) combined with high disk read operations, it strongly indicates that the database requires more RAM to keep active data cached."
Q18: How do you monitor and handle replication lag in a high-availability database cluster?
Answer: "I monitor metrics like Replication Lag (in seconds) or Log Bytes Flushed/Received between primary and secondary nodes. If replication lag begins to rise, it means the secondary node cannot keep up with the write volume of the primary. I configure alerts on this threshold because high replication lag risks data loss during an automated failover and results in dirty reads on read-heavy secondary replicas."
🔒 Security, Vulnerabilities & Compliance
Q19: How can an observability platform help detect a Distributed Denial of Service (DDoS) attack?
Answer: "A DDoS attack shows a massive, sudden surge in incoming traffic metrics (requests per second or network bandwidth) often accompanied by a spike in HTTP 4xx/5xx errors as servers become saturated. By looking at geographical traffic metrics, connection rates per client IP, and firewall logs through our dashboards, we can quickly identify the pattern and collaborate with the security team to enable DDoS mitigation policies (like Azure DDoS Protection or Cloudflare)."
Q20: Why is log rotation important, and how do you ensure security logs are not lost during this process?
Answer: "Log rotation prevents disk space exhaustion by compressing and archiving old logs. To ensure security and compliance logs are not lost, I ensure they are immediately streamed in near real-time to a centralized, write-once-read-many (WORM) storage or a dedicated log management space (like Azure Log Analytics) before local rotation occurs. Local log files are retained for a safe buffer period before deletion."
Q21: How do you ensure that Sensitive Data (PII) is not captured in your logging and tracing tools?
Answer: "Capturing PII (Personally Identifiable Information like credit cards or passwords) violates compliance frameworks (GDPR, LGPD, PCI-DSS). I implement masking and filtering at multiple levels: enforcing coding standards for developers to avoid logging object dumps, configuring log shippers (like Fluent Bit or Logstash) to use regex-based masking to redact sensitive patterns, and using APM data-masking rules before telemetry leaves the production environment."
🔄 ITSM, Automation & DevOps (ITIL)
Q22: What is the difference between a Workaround and a Known Error in ITIL Problem Management?
Answer: "A Workaround is a temporary way to restore service to users during an incident without fixing the underlying cause (e.g., restarting a service or failing over). A Known Error is a problem that has a documented root cause and a workaround, but a permanent fix has not yet been deployed (often logged in a Known Error Database - KEDB). As an analyst, I use KEDBs to quickly resolve recurring incidents using known workarounds."
Q23: How do you implement "Observability as Code" within a CI/CD pipeline?
Answer: "Observability as Code means defining dashboards, alert thresholds, and notification channels using Infrastructure as Code (IaC) tools like Terraform or Azure Bicep. When developers deploy new services via CI/CD pipelines, the monitoring resources are automatically provisioned and updated along with the infrastructure. This ensures no resource is deployed into production without monitoring."
Q24: What is a Post-Mortem / Blameless Post-Mortem, and what is your role in it?
Answer: "A blameless post-mortem is a meeting held after a major incident to understand how the system failed, without blaming individuals. My role as an analyst is to provide objective data: timeline graphs, logs, and trace data showing exactly when the system degraded, what alerts triggered, and how long containment took. We use this data to identify systemic gaps in architecture, automation, or visibility."
Q25: How do you use automated remediation or healing scripts alongside your monitoring alerts?
Answer: "For well-understood, predictable incidents—like a specific service crashing due to an unavoidable vendor bug—I configure monitoring alerts to trigger an automated action instead of paging a human. In Azure, this means linking an alert to an Azure Automation Runbook or an Azure Function that safely executes a script (e.g., clearing a temporary cache or restarting a service container), verifying recovery, and updating the incident ticket automatically."
🛠️ Section 1: Monitoring & Observability Fundamentals
Q1: What is the main difference between Monitoring and Observability, and how do the "Three Pillars" fit into this?
Answer: > " Monitoring tells you when something is wrong by tracking predefined metrics (e.g., 'CPU usage is above 90%'). It's reactive. Observability, on the other hand, allows you to understand why something is wrong by looking at the internal state of a system based on its external outputs. It is proactive and relies on the Three Pillars:
Metrics: Numeric data measured over intervals (e.g., memory usage, request counts) to detect trends.
Logs: Timestamped records of discrete events (e.g., error messages, system audits) to provide context.
Traces: The end-to-end journey of a request through distributed systems, crucial for finding bottlenecks in microservices."
Q2: How do you avoid "alert fatigue" when configuring thresholds and alerts for infrastructure?
Answer: "To prevent alert fatigue, I follow a few key best practices:
Focus on Symptoms over Causes: Instead of alerting on high CPU (cause), alert on high latency or error rates (symptom affecting the user).
Use Dynamic Thresholds: Utilize anomaly detection and baseline behavior rather than static numbers, especially for workloads with seasonal peaks.
Actionable Alerts: Every alert must be actionable. If an engineer receives an alert and doesn't need to take immediate action, it should be a warning log or a daily report, not a page.
Tiered Alerting: Route high-priority alerts (P1/P2) to paging systems (like PagerDuty) and low-priority alerts to Slack/Teams."
☁️ Section 2: Microsoft Azure & Cloud Monitoring
Q3: Which native Azure tools would you use to implement a complete observability strategy for a hybrid application?
Answer: "I would leverage the Azure Monitor ecosystem:
Azure Monitor Logs (Log Analytics Workspaces): To centralize logs from both Azure resources and on-premises servers (via the Azure Monitor Agent).
Application Insights: To monitor application performance (APM), tracking live metrics, dependencies, and traces.
Azure Monitor Metrics: For real-time infrastructure performance data.
Azure Workbooks: To create unified, interactive dashboards for visualization across different subscriptions and hybrid environments."
Q4: How would you monitor an application hosted in Azure that is experiencing intermittent latency, and how do you find the root cause?
Answer: "I would use Application Insights and look at the Application Map to see the dependencies and where the delay is happening (e.g., a slow database query or a third-party API). I would then dive into End-to-End Transaction Details to see the distributed trace of the slow requests. If the application itself is fine, I would check Azure Monitor Metrics for the underlying infrastructure (like Azure App Services or VMs) to see if there is CPU throttling or SNAT port exhaustion."
🖥️ Section 3: Servers, Applications, & Databases
Q5: If a Linux or Windows server is showing 100% CPU utilization, what is your step-by-step troubleshooting process?
Answer: "First, I look at the monitoring dashboard to see when the spike started and if it correlates with a new deployment, a cron job, or a traffic spike.
For Linux: I would SSH in and use commands like top or htop to identify the specific process consuming the CPU. I'd also check iostat to ensure it's not a CPU wait issue due to slow disk I/O.
For Windows: I would use Task Manager or Get-Process in PowerShell.
Once the process is identified, I check the application logs around that timestamp to understand what the process was executing, and notify the responsible DevOps/Development team with the evidence."
Q6: How do you monitor database health, and what metrics indicate a performance degradation?
Answer: "Database monitoring requires looking at both OS-level and database-level metrics. The key metrics I track are:
CPU and Memory Utilization: High memory usage usually means poor indexing or bad caching.
Active Connections: To ensure the application connection pool isn't exhausted.
Query Latency / Long-Running Queries: To find unoptimized SQL queries.
Deadlocks and Lock Wait Time: High wait times indicate queries are blocking each other.
IOPS (Input/Output Operations Per Second): To ensure we aren't hitting storage throughput limits."
🔒 Section 4: Vulnerabilities & ITSM/ITIL Process
Q7: What is the role of an Observability Analyst regarding security and vulnerability management?
Answer: "While I am not a dedicated security engineer, observability is critical for DevSecOps. I can support security by:
Log Auditing: Ensuring authentication logs, firewall logs, and system events are collected in a SIEM or central log repository to detect brute-force attacks or unauthorized access.
Vulnerability Alerts: Monitoring system patch levels and integrating vulnerability scanner alerts (like Microsoft Defender for Cloud or Qualys) into operational dashboards.
Anomaly Detection: Setting up alerts for unusual traffic spikes, unexpected outbound connections, or massive data transfers which could indicate a breach or data exfiltration."
Q8: Imagine a critical service goes down (P1 Incident). Walk me through your actions following ITIL best practices.
Answer: "1. Identification & Logging: The monitoring system triggers a critical alert. I verify the impact and ensure an incident ticket is created.
2. Containment & Restoration (Incident Management): The primary goal is to restore the service as fast as possible. I join the war room, share metrics/logs with the infrastructure and dev teams, and support actions like restarting services or failing over to a backup region.
3. Communication: Keep stakeholders updated on the restoration progress based on SLAs.
4. Problem Management (RCA): Once the service is stable, I participate in the Root Cause Analysis (RCA). I provide the historical logs and metrics to find why it happened, and create new monitoring rules or automation to prevent it from happening again."