Causality Based Instant Root Cause Analysis for Microservices Failure (2024)

By Mohamed Sharafath M, Comcast India Engineering Center Praveen Manoharan, Comcast India Engineering Center Aravindakumar Venugopalan, Comcast India Engineering Center

 In modern distributed systems, the complexity and scale of operations often lead to challenging issues in 

identifying the root causes of system failures [1] ]. Traditional ways of finding out why something happened might not work well with these complicated systems, especially if they only use metrics or logs data. The huge volume of data makes manual tracing and debugging of issues impractical in a time crunch situation. The inherent limitations of isolated data sources often result in prolonged downtime, increased operational costs, and hindered system performance. Our proposed solution seeks to automate the construction of microservice dependencies by leveraging causal discovery techniques with multi-variate time-series data. With an increasing focus on explainability in many domains, causal inference has attracted much attention in the industry [2] ]. In this paper, we consider a fault in microservices as an intervention in causal inference. The Bayesian-based causal inference algorithms [3] are applied to the constructed dependency graph tree at each level. This facilitates the swift identification of the likely root cause path of microservice failures. Such prompt analysis empowers site reliability engineers (SREs) to make informed, data-driven decisions. In this paper, we discuss how implementing Causality based instant Root Cause Analysis (RCA) methods in AI for Information Technology Operations (AIOps) platforms improves reliability for efficient triaging to reduce Mean Time to Repair (MTTR).

By clicking the "Download Paper" button, you are agreeing to our terms and conditions.

Similar Papers

The New Explosion of Social Engineering: Defensive Techniques to Manage the Risk
By Abdul Saleem, Comcast India Engineering Center; Poornasakthi Sivaraman, Comcast India Engineering Center
2023
THE WiFi Happiness Index
By Krithika Raman, Comcast India Engineering Center LLP; Charles Moreman, Comcast Cable
2021
AI for IT Operations (AIOps) - Using AI/ML for Improving IT Operations
By Hongcheng Wang, Applied AI & Discovery, Comcast; Praveen Manoharan, Applied AI & Discovery, Comcast; Nilesh Nayan, Applied AI & Discovery, Comcast; Aravindakumar Venugopalan, Applied AI & Discovery, Comcast; Abhijeet Mulye, Applied AI & Discovery, Comcast; Tianwen Chen, Applied AI & Discovery, Comcast; Mateja Putic, Applied AI & Discovery, Comcast
2022
Scaling a SCTE-224 Policy Decision System to Accommodate Burst Loads Driven by Marquee Events
By Madhuvanth Gopalan, Comcast India Engineering Center; Timothy Wilson, Comcast Technology Solutions; Stuart Kurkowski, PhD, Comcast Technology Solutions
2022
Photon Avatars in the Comcast Cosmos: An End-to-End View of Comcast Core, Metro and Access Networks
By Venk Mutalik, Steve Ruppa, Fred Bartholf, Bob Gaydos, Steve Surdam, Amarildo Vieira, Dan Rice; Comcast
2022
Two Years Of Deploying ITV/EBIF Applications – Comcast’s Lessons Learned
By Robert Dandrea, Ph.D., Comcast Cable
2010
Software Reliability Engineering: Scaling the Cloud with Automation
By Brian Gray, Sriram Ramakrishnan & Fei Wan, Sr., Comcast Cable
2021
Comcast Underground: Innovative Fiber Deployments Over Existing Underground Critical Infrastructure
By Venk Mutalik, Pat Wike, Doug Combs, Alan Gardiner, Dan Rice; Comcast
2022
Traffic Engineering, Traffic Control, Performance Analysis And Node Combining In DOCSIS-Based Cable Networks
By Gagan L. Choudhury and Moshe Segal, AT&T Labs
2001
Key Learnings from Comcast’s Use of Open Source Software in the Access Network
By Louis Donofrio & Qin Zang, Comcast Cable; Vignesh Ramamurthy, Infosys Consulting
2020
More Results >>