Root Cause Analysis of Outliers with Missing Structural Knowledge (2025)

Nastaran OkatiCorrespondence to: nastaran@mpi-sws.orgMax Planck Institute for Software SystemsSergio Hernan Garrido MejiaMax Planck Institute for Intelligent SystemsWilliam Roy OrchardUniversity of Cambridge
Patrick Blöbaum
Amazon Research
Dominik JanzingAmazon Research

Abstract

Recent work conceptualized root cause analysis (RCA) of anomalies via quantitative contribution analysis using causal counterfactuals in structural causal models (SCMs).The framework comes with three practical challenges:(1) it requires the causal directed acyclic graph (DAG), together with an SCM,(2) it is statistically ill-posed since it probes regression models in regions of low probability density,(3) it relies on Shapley values which are computationally expensive to find.

In this paper, we propose simplified, efficient methods of root cause analysis when the task is to identify a unique root cause instead of quantitative contribution analysis.Our proposed methods run in linear order of SCM nodes and they require only the causal DAG without counterfactuals.Furthermore, for those use cases where the causal DAG is unknown, we justify the heuristic of identifying root causes as the variables with the highest anomaly score.

1 Introduction

Detecting anomalies in a large number of variables has become a huge effortin science and business application, ranging from meteorology[1], monitoring medical health[2, 3], monitoring of industrial fabrication[4], fraud detection[5], credit scoring[6], cloud applications[7, 8, 9, 10, 11, 12] and more[13, 14, 15, 16, 17].Oftentimes it is not only important to detect the anomalies but also to answer the question of why the anomaly was observed, or in other words, find its root cause. This is important as it provides insight on how to mitigate the issue manifested in the anomaly, and is hence more actionable.A series of previous works use statistical correlations between features to explain the anomalous value in the target variable[18, 19, 20, 21, 22].The major drawback with such approaches is that correlations between the anomalous variables and the target variable do not necessarily imply causation. Furthermore, such correlation-based approaches neglect the fact that very small perturbations in some variables can cause extreme anomalies in others.Hence, recent works conceptualized RCA using a causal framework[23, 10, 24, 25, 26].From a causal perspective, finding causal relations between anomalies that are simultaneously or consecutively observed in different variablesis an important step in identifying the root cause, i.e., answering the question of which anomaly caused the other.

Related workWe can categorize the related work into two groups: 1) those that require the causal DAG to find the root cause, and, 2) those which do not require the causal DAG or try to infer it from data.Considering the first lines of work, our work builds on[23] interms of the definition of root cause, where the authors use a causal graphical model based framework and utilize Shapley values to quantify the contribution of each of the variables to the extremeness of the target variable.[27, 28, 9]use a traversal-based approach that identifies a node as a root causeunder two conditions: 1) noneof its parent nodes exhibitsanomalous behavior and 2) it is linked to the anomalous target node through a path of anomalous nodes.[29]apply such a traversal algorithm to data from cloud computing to find the root cause of performance drops in dependent services[12, 30, 31].The main idea behind the second line of work is that if the DAG is unknown, the outliers themselves can support causal discovery. In a scenario where several data points are available from the anomalous state(which we do not assume here), finding the root cause can also be modeled as the problem of identifying the intervention target of a soft intervention[32].Likewise,[33, 34] describe how to use heavy-tailed distributions to infer the causal DAG for linear models. In a sense, this also amounts to using outliers for causal discovery – if one counts points in the tail as outliers.

Our contributionsThis paper tries to mitigate the practical challenges of RCAunder the assumption thatthe task is restricted to identifying a unique root cause whose contribution dominates contributions of the other nodes, rather than quantitative contribution analysis.To this end, we relax the assumptions on the available causal information step by step: In Section3, we first explain that interventional probabilities are sufficient and we do not need torely on structural equation based counterfactuals111rung 2 vs rung 3 in the ladder of causation[35]). This finding enables us to design the Topologically Ordered Contribution Analysis (TOCA) algorithm.In Section4 we discuss how to avoid the statistical problem of estimating interventional probabilities and propose SMOOTH TRAVERSAL, an algorithm that inputs only the causal graph, and SCORE ORDERING, an algorithm that is only based on the ordering of the anomaly scores and hence, does not even need the causal graph.All our methods avoid a combinatorial explosion of terms inShapley value based RCA.Section5 compares the approaches on simulated and real-world data.

2 Formal Framework for Root Cause Analysis

We first sketch the causal framework that our method and analysis are based on[36, 35].Consider a system of n𝑛nitalic_n random variables (X1,,Xn):=𝐗assignsubscript𝑋1subscript𝑋𝑛𝐗(X_{1},\dots,X_{n}):=\mathbf{X}( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := bold_X and an underlying DAG corresponding to the causal relations between them in a way that a directed edge from Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT indicates that Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT directly causes Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Furthermore, we are given an SCM with structural equations

Xi:=fi(Pai,Ni),assignsubscript𝑋𝑖subscript𝑓𝑖𝑃subscript𝑎𝑖subscript𝑁𝑖X_{i}:=f_{i}(Pa_{i},N_{i}),italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(1)

which indicates that each variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a function of its parents Pai𝑃subscript𝑎𝑖{Pa}_{i}italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT together with a noise variable Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where (N1,,Nn):=𝐍assignsubscript𝑁1subscript𝑁𝑛𝐍(N_{1},\ldots,N_{n}):=\mathbf{N}( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := bold_N are jointly independent.We always assume Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to be the sink note and the target of interest, whose anomaly is supposed to be explained via root causes further upstream.An iterative use of Eq.1 results in a representation

Xn=F(N1,,Nn),subscript𝑋𝑛𝐹subscript𝑁1subscript𝑁𝑛X_{n}=F(N_{1},\dots,N_{n}),italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_F ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,(2)

in which the structural causal information is implicit in the function F𝐹Fitalic_F.

Consider the case where the value xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the target variable Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is flagged as an outlier, and we wish to find the unique root cause of this outlier among the variables (X1,,Xn)subscript𝑋1subscript𝑋𝑛(X_{1},\ldots,X_{n})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).By Eq.2 we have that

xn=F(n1,,nn),subscript𝑥𝑛𝐹subscript𝑛1subscript𝑛𝑛x_{n}=F(n_{1},\dots,n_{n}),italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_F ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,

where (n1,,nn):=𝐧assignsubscript𝑛1subscript𝑛𝑛𝐧(n_{1},\dots,n_{n}):=\mathbf{n}( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := bold_n denote the corresponding values of the noise variables. This representation allows us to consider the contribution of each variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the outlier value in xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in the target variable as the contribution of its noise term222Throughout the paper, we assume the SCM is invertible and one can recover such noise values from the observed value of the variable together with that of its parents[37].[23, 38]333Note that in[38] no algorithm for root cause analysis is proposed..The idea of attributing anomalies to noise terms is justified by the perspective that each njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT switches between deterministic mechanisms fj(.,nj)f_{j}(.,n_{j})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( . , italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), “response functions” [39],and the goal of RCA is to identify the “corrupted mechanisms” which did not work as usual, i.e., the values xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTthat do not follow the SCM xj=fj(paj,nj)subscript𝑥𝑗subscript𝑓𝑗𝑝subscript𝑎𝑗subscript𝑛𝑗x_{j}=f_{j}(pa_{j},n_{j})italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_p italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with a value njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT “in a normal range”.In other words, the recursive formulation allows us to attribute the extremeness in xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to a corrupted noise term at one of the upstream nodes. As a result, the framework is based on the intuition that if Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the unique root cause, then replacing njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with a normal value would change xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to a non-outlier, with high probability.

2.1 Anomaly Score

We next define an outlier score which we use in the next sections for our method and analysis.Let g:𝒳:𝑔𝒳g:\mathcal{X}\rightarrow\mathbb{R}italic_g : caligraphic_X → blackboard_R be an appropriate feature map. This feature map can be any existing outlier score function mapping elements of 𝒳𝒳\mathcal{X}caligraphic_X to real values, for example, z-score444The z-score of a sample x𝑥xitalic_x is z=xμσ𝑧𝑥𝜇𝜎z=\frac{x-\mu}{\sigma}italic_z = divide start_ARG italic_x - italic_μ end_ARG start_ARG italic_σ end_ARG where μ𝜇\muitalic_μ is the population mean, and σ𝜎\sigmaitalic_σ is the population standard deviation..Further, define the event

E:={g(Xn)g(xn)},assign𝐸𝑔subscript𝑋𝑛𝑔subscript𝑥𝑛\displaystyle E:=\{g(X_{n})\geq g(x_{n})\},italic_E := { italic_g ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_g ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } ,(3)

of being more extreme than the observed event xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT according to the feature function g𝑔gitalic_g.From the point of view of statistical testing, g𝑔gitalic_g can be seen as test statistics for the null hypothesis that xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is drawn from P(Xn)𝑃subscript𝑋𝑛P(X_{n})italic_P ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), i.e., H0:xnP(Xn):subscript𝐻0similar-tosubscript𝑥𝑛𝑃subscript𝑋𝑛H_{0}:x_{n}\sim P(X_{n})italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_P ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and P(g(Xn)g(xn))=P(E)𝑃𝑔subscript𝑋𝑛𝑔subscript𝑥𝑛𝑃𝐸P(g(X_{n})\geq g(x_{n}))=P(E)italic_P ( italic_g ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_g ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = italic_P ( italic_E ) is the corresponding p-value.Note that small P(E)𝑃𝐸P(E)italic_P ( italic_E ) indicates that the null hypothesis is rejected with high confidence555This interpretation is certainly invalid in a standard anomaly detection scenario where Eq.3 is repeatedly computed for every observation in a sequence. However, so-called anytime p-values [40] are outside of our scope. and we rather assume that the usual mechanism generating samples from P(Xn)𝑃subscript𝑋𝑛P(X_{n})italic_P ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) has been corrupted for that specific statistical unit.Sincesmall p-values correspond to strong anomalies,we define the anomaly score by

S(xn)=logP(g(Xn)g(xn))=logP(E).𝑆subscript𝑥𝑛𝑃𝑔subscript𝑋𝑛𝑔subscript𝑥𝑛𝑃𝐸S(x_{n})=-\log P(g(X_{n})\geq g(x_{n}))=-\log P(E).italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = - roman_log italic_P ( italic_g ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_g ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = - roman_log italic_P ( italic_E ) .(4)

Since the logarithm of the probability of an event measures its ‘surprise’ in information theory [13, 15], as in[23], we call anomaly scores with the above calibration ‘IT scores’ for information theoretic. Given k𝑘kitalic_k observations xn1,,xnksuperscriptsubscript𝑥𝑛1subscriptsuperscript𝑥𝑘𝑛x_{n}^{1},\dots,x^{k}_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from P(Xn)𝑃subscript𝑋𝑛P(X_{n})italic_P ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), wewill often usethe following simple estimator

S^(xnk):=log1k|{ik|withg(xni)g(xnk)}|.assign^𝑆superscriptsubscript𝑥𝑛𝑘1𝑘conditional-set𝑖𝑘with𝑔subscriptsuperscript𝑥𝑖𝑛𝑔subscriptsuperscript𝑥𝑘𝑛\hat{S}(x_{n}^{k}):=-\log\frac{1}{k}|\{i\leq k\,|\,\,\hbox{with }g(x^{i}_{n})%\geq g(x^{k}_{n})\}|.over^ start_ARG italic_S end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) := - roman_log divide start_ARG 1 end_ARG start_ARG italic_k end_ARG | { italic_i ≤ italic_k | with italic_g ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_g ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } | .(5)

It is important to note that this value can be at most logk𝑘\log kroman_log italic_k. As a result, for the estimate S^(xn)^𝑆subscript𝑥𝑛\hat{S}(x_{n})over^ start_ARG italic_S end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), at least eS^(xn)superscript𝑒^𝑆subscript𝑥𝑛e^{\hat{S}(x_{n})}italic_e start_POSTSUPERSCRIPT over^ start_ARG italic_S end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT data samples must have been used.

2.2 Contribution Analysis

To compute the contribution of each noise Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the anomaly score of the outlier xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, [23] measurehowreplacingthe observed value nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (originating from a potentially corrupted mechanism) with a normal value, changes the likelihood of the outlier event. Intuitively, this shows us the extent to which nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has been responsible for the extremeness of xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This change of probability, however, should be measured given different contexts, with each context being the set of variables that are already changed to a normal value.To formalize this, for any subset of the index set, 𝒮𝒰:={1,,n}𝒮𝒰assign1𝑛{\mathcal{S}}\subseteq\mathcal{U}:=\left\{1,\ldots,n\right\}caligraphic_S ⊆ caligraphic_U := { 1 , … , italic_n }, we first define the probability of the outlier event when all nodes in 𝒮¯=𝒰𝒮¯𝒮𝒰𝒮\bar{{\mathcal{S}}}=\mathcal{U}\setminus{\mathcal{S}}over¯ start_ARG caligraphic_S end_ARG = caligraphic_U ∖ caligraphic_S are randomized according to their joint distribution, and only those in 𝒮𝒮{\mathcal{S}}caligraphic_S are kept fixed, i.e.,

q(𝒮):=P(E|𝐍𝒮=𝐧𝒮).assign𝑞𝒮𝑃conditional𝐸subscript𝐍𝒮subscript𝐧𝒮q({\mathcal{S}}):=P(E\,|\,{\bf N}_{{\mathcal{S}}}={\bf n}_{{\mathcal{S}}}).italic_q ( caligraphic_S ) := italic_P ( italic_E | bold_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) .(6)

Now the contribution of a node j𝑗jitalic_j given the context 𝒮¯¯𝒮\bar{{\mathcal{S}}}over¯ start_ARG caligraphic_S end_ARG is defined as:

C(j|𝒮¯):=logq(𝒮{j})q(𝒮).assign𝐶conditional𝑗¯𝒮𝑞𝒮𝑗𝑞𝒮C(j|\bar{{\mathcal{S}}}):=\log\frac{q({\mathcal{S}}\cup\{j\})}{q({\mathcal{S}}%)}.italic_C ( italic_j | over¯ start_ARG caligraphic_S end_ARG ) := roman_log divide start_ARG italic_q ( caligraphic_S ∪ { italic_j } ) end_ARG start_ARG italic_q ( caligraphic_S ) end_ARG .(7)

Let Π:𝒰𝒰:Π𝒰𝒰\Pi:\mathcal{U}\rightarrow\mathcal{U}roman_Π : caligraphic_U → caligraphic_U be the set of all possible permutations of the nodes and πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π be any permutation. One then defines the contribution of a node j𝑗jitalic_j given permutation π𝜋\piitalic_π as

Cπ(j):=C(j|Iπ>j),assignsuperscript𝐶𝜋𝑗𝐶conditional𝑗superscript𝐼𝜋𝑗\displaystyle C^{\pi}(j):=C(j|I^{\pi>j}),italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_j ) := italic_C ( italic_j | italic_I start_POSTSUPERSCRIPT italic_π > italic_j end_POSTSUPERSCRIPT ) ,(8)

where Iπ>jsuperscript𝐼𝜋𝑗I^{\pi>j}italic_I start_POSTSUPERSCRIPT italic_π > italic_j end_POSTSUPERSCRIPT denotes the set of nodes that appear after j𝑗jitalic_j with respect to π𝜋\piitalic_π, i.e., Iπ<j={i𝒰|πi<πj}superscript𝐼𝜋𝑗conditional-set𝑖𝒰subscript𝜋𝑖subscript𝜋𝑗I^{\pi<j}=\left\{i\in\mathcal{U}\,|\,\pi_{i}<\pi_{j}\right\}italic_I start_POSTSUPERSCRIPT italic_π < italic_j end_POSTSUPERSCRIPT = { italic_i ∈ caligraphic_U | italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }.One can easily see that for each permutation π𝜋\piitalic_π, S(xn)𝑆subscript𝑥𝑛S(x_{n})italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) decomposes into the contributions of each node:

S(xn)=j{1,,n}Cπ(j).𝑆subscript𝑥𝑛subscript𝑗1𝑛superscript𝐶𝜋𝑗S(x_{n})=\sum_{j\in\left\{1,\ldots,n\right\}}C^{\pi}(j).italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ { 1 , … , italic_n } end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_j ) .

To symmetrize over all the possible permutations in ΠΠ\Piroman_Π, the Shapley contribution is calculated as:

CSh(j):=1n!πΠC(j|Iπ>j),assignsuperscript𝐶𝑆𝑗1𝑛subscript𝜋Π𝐶conditional𝑗superscript𝐼𝜋𝑗C^{Sh}(j):=\frac{1}{n!}\sum_{\pi\in\Pi}C(j|I^{\pi>j}),italic_C start_POSTSUPERSCRIPT italic_S italic_h end_POSTSUPERSCRIPT ( italic_j ) := divide start_ARG 1 end_ARG start_ARG italic_n ! end_ARG ∑ start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_C ( italic_j | italic_I start_POSTSUPERSCRIPT italic_π > italic_j end_POSTSUPERSCRIPT ) ,(9)

which is certainly expensive to compute[23].Here, however, motivated by the fact that in most real-world scenarios there is a single corrupted factor causing anomalies,we try to find the unique node j𝑗jitalic_j with a corrupted mechanism, or the unique root cause of the outlier xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. As our first result, we show that in this scenario, it is unlikely that nodes other than j𝑗jitalic_j get a significant value for their contribution. Note that this directly allows us to identify the root cause without the need for quantitatively finding the contribution of each and every node in the outlier value of the target variable Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which saves us exponential computations. In addition to the computational load, the symmetrization suffers from a more fundamental issue, namely that for most permutations π𝜋\piitalic_π, the contributions Cπ(j)superscript𝐶𝜋𝑗C^{\pi}(j)italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_j ) rely on structural equations (rung 3 causal information according to the ladder of causation in [35]), while some Cπ(.)C^{\pi}(.)italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( . ) rely only on the graphical model (rung 2), which we explain in the next sub-section.

2.3 Interventional vs Counterfactual RCA

Since the SCM cannot be inferred from observational and interventional data,relying on the SCM is a serious bottleneck for RCA. Fortunately,no term in Eq.9 requires the SCM.We explain this through an example for the bivariate causal relation X1X2subscript𝑋1subscript𝑋2X_{1}\to X_{2}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:given the SCM X2=f(X1,N2X_{2}=f(X_{1},N_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_f ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and X1=N1subscript𝑋1subscript𝑁1X_{1}=N_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, randomizing N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and fixing N1=n1=x1subscript𝑁1subscript𝑛1subscript𝑥1N_{1}=n_{1}=x_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT generates X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTaccording to the observational conditional P(X2|x1)𝑃conditionalsubscript𝑋2subscript𝑥1P(X_{2}|x_{1})italic_P ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (which no longer relies on the SCM), while, randomizing N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and adjusting N2=n2subscript𝑁2subscript𝑛2N_{2}=n_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT cannot be resolved into any observational term (see also Section 5 in[41]).666This can be checked by an SCM with two binaries where X2=X1N2subscript𝑋2direct-sumsubscript𝑋1subscript𝑁2X_{2}=X_{1}\oplus N_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with unbiased N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which induces the same distribution as the SCM X2=N2subscript𝑋2subscript𝑁2X_{2}=N_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are disconnected and thus X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT does not get any contribution.The following result generalizes this insight:

Proposition 2.1

Whenever π𝜋\piitalic_π is a topological order of the causal DAG, i.e., there are no arrowsXiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\to X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for π(i)>π(j)𝜋𝑖𝜋𝑗\pi(i)>\pi(j)italic_π ( italic_i ) > italic_π ( italic_j ), allcontributions Cπ(j)superscript𝐶𝜋𝑗C^{\pi}(j)italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_j ) can be computed from observational conditional distributions.

All proofs can be found in AppendixA. While [23] introduced Shapley contribution in Eq.9 for a quantitative contribution analysis, we are interested in finding the unique root cause whenever it exists. In other words, we are not interested in ‘fair’ quantitative attribution analysis since we assume that one node is dominating anyway, regardless of which order we choose.

3 Simplified contribution analysis withoutcounterfactuals

While we can never exclude the case where several mechanisms are corrupted, we always start with the working hypothesis that there has been only one. If this does not work, we can still try out hypotheses with more than one, but with the strong inductive bias of preferring explanations where most mechanisms worked as expected, in agreement with the so-called ‘sparse mechanism shift hypothesis’ in[42, 43].

To formalizethe assumption of a unique root cause,we firstgeneralize the notion of the contribution of a node tothe contribution of a set 𝒰𝒰\mathcal{R}\subseteq\mathcal{U}caligraphic_R ⊆ caligraphic_U of nodes given a context 𝒮¯¯𝒮\bar{{\mathcal{S}}}over¯ start_ARG caligraphic_S end_ARG, i.e.,

C(|𝒮¯):=logq(𝒮)q(𝒮).assign𝐶conditional¯𝒮𝑞𝒮𝑞𝒮C(\mathcal{R}|\bar{{\mathcal{S}}}):=\log\frac{q({\mathcal{S}}\cup\mathcal{R})}%{q({\mathcal{S}})}.italic_C ( caligraphic_R | over¯ start_ARG caligraphic_S end_ARG ) := roman_log divide start_ARG italic_q ( caligraphic_S ∪ caligraphic_R ) end_ARG start_ARG italic_q ( caligraphic_S ) end_ARG .(10)

This notion becomes ratherintuitive after observingthat it is given by the sum ofthe contributions of all the elements in \mathcal{R}caligraphic_R when they are one by one added to the context 𝒮¯¯𝒮\bar{{\mathcal{S}}}over¯ start_ARG caligraphic_S end_ARG:

Lemma 3.1

For any set 𝒯𝒮𝒯𝒮\mathcal{R}\subseteq{\mathcal{T}}\setminus{\mathcal{S}}caligraphic_R ⊆ caligraphic_T ∖ caligraphic_S, it holds that C(|𝒮¯):=i=1kC(ji|𝒮{j1,,ji}¯)assign𝐶conditional¯𝒮superscriptsubscript𝑖1𝑘𝐶conditionalsubscript𝑗𝑖¯𝒮subscript𝑗1subscript𝑗𝑖C(\mathcal{R}|\bar{{\mathcal{S}}}):=\sum_{i=1}^{k}C(j_{i}|\overline{{\mathcal{%S}}\cup\mathcal{R}\setminus\left\{j_{1},\dots,j_{i}\right\}})italic_C ( caligraphic_R | over¯ start_ARG caligraphic_S end_ARG ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_C ( italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over¯ start_ARG caligraphic_S ∪ caligraphic_R ∖ { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_ARG ) with ={j1,,jk}subscript𝑗1subscript𝑗𝑘\mathcal{R}=\left\{j_{1},\ldots,j_{k}\right\}caligraphic_R = { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }.

Next, the following result shows that it is unlikely to obtain high contributions when the noise values are randomly drawn from their usual distributions, i.e., we have the following proposition:

Proposition 3.2

For any set of observed noise values 𝐧Ssubscript𝐧𝑆{\bf n}_{S}bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPTit holds that P(C(|𝒮¯)α)eα𝑃𝐶conditional¯𝒮𝛼superscript𝑒𝛼P\left(C(\mathcal{R}|\bar{{\mathcal{S}}})\geq\alpha\right)\leq e^{-\alpha}italic_P ( italic_C ( caligraphic_R | over¯ start_ARG caligraphic_S end_ARG ) ≥ italic_α ) ≤ italic_e start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT.

The way Proposition3.2 is phrased, it assumes that all noise variablesN1,,Nnsubscript𝑁1subscript𝑁𝑛N_{1},\dots,N_{n}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT follow their usual distributions P(Nj)𝑃subscript𝑁𝑗P(N_{j})italic_P ( italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). This is actually not the case we are interested in. When we observe an anomaly S(xn)𝑆subscript𝑥𝑛S(x_{n})italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) we assume that at least one mechanism is corrupted for theobservation of interest and thus at least for one j𝑗jitalic_j, the value njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT has been drawn from a different distribution P~(Nj)~𝑃subscript𝑁𝑗\tilde{P}(N_{j})over~ start_ARG italic_P end_ARG ( italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). However, one can easily seethat the proposition still holds when all Njsubscript𝑁𝑗N_{j}italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT not in \mathcal{R}caligraphic_R are drawn from a different distribution. We therefore read Proposition3.2 as saying thatit is unlikely that non-corrupted nodes have high contributions.

To find the root cause j𝑗jitalic_j, we will describe an algorithm that replaces step by step each nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTwith random values n~isubscript~𝑛𝑖\tilde{n}_{i}over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and estimatehow this affects the probability of the event E𝐸Eitalic_E as in Eq.3, which we call the outlier event, for short777This is actually oversimplified language because a value x~nsubscript~𝑥𝑛\tilde{x}_{n}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is certainly still an outlier when g(x~n)𝑔subscript~𝑥𝑛g(\tilde{x}_{n})italic_g ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is slightly below g(xn)𝑔subscript𝑥𝑛g(x_{n})italic_g ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).We observe that for any set 𝒮𝒮{\mathcal{S}}caligraphic_S, the probability of obtaining an outlier when all noise values in 𝒮𝒮{\mathcal{S}}caligraphic_S are randomized can be rewritten in terms of contributions of sets:

Lemma 3.3 (outlier probabilities from contributions of sets)

The probability for the outlier event E𝐸Eitalic_E when all noise variables in 𝒮𝒮{\mathcal{S}}caligraphic_S are randomized can be written in terms of contributions of 𝒮𝒮{\mathcal{S}}caligraphic_S or 𝒮¯¯𝒮\bar{{\mathcal{S}}}over¯ start_ARG caligraphic_S end_ARG, i.e.,

C(𝒮¯|𝒮)=𝑖eC(𝒮¯|𝒮)S(xn)=iieC(S|S¯).𝐶conditional¯𝒮𝒮𝑖superscript𝑒𝐶conditional¯𝒮𝒮𝑆subscript𝑥𝑛𝑖𝑖superscript𝑒𝐶conditional𝑆¯𝑆\displaystyle C(\bar{\mathcal{S}}\,|\,{\mathcal{S}})\overset{i}{=}e^{C(\bar{{%\mathcal{S}}}|{\mathcal{S}})-S(x_{n})}\overset{ii}{=}e^{-C(S|\bar{S})}.italic_C ( over¯ start_ARG caligraphic_S end_ARG | caligraphic_S ) overitalic_i start_ARG = end_ARG italic_e start_POSTSUPERSCRIPT italic_C ( over¯ start_ARG caligraphic_S end_ARG | caligraphic_S ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_i italic_i end_OVERACCENT start_ARG = end_ARG italic_e start_POSTSUPERSCRIPT - italic_C ( italic_S | over¯ start_ARG italic_S end_ARG ) end_POSTSUPERSCRIPT .(11)

We now consider contributions with respect to a given permutation, π𝜋\piitalic_π, of the index set 𝒰={1,,n}𝒰1𝑛\mathcal{U}=\left\{1,\ldots,n\right\}caligraphic_U = { 1 , … , italic_n }. To this end, remember from the previous section that Iπ<jsuperscript𝐼𝜋𝑗I^{\pi<j}italic_I start_POSTSUPERSCRIPT italic_π < italic_j end_POSTSUPERSCRIPT denotes the set of nodes which appear before j𝑗jitalic_j with respect to the permutation π𝜋\piitalic_π, i.e., Iπ<j={i𝒰|πi<πj}superscript𝐼𝜋𝑗conditional-set𝑖𝒰subscript𝜋𝑖subscript𝜋𝑗I^{\pi<j}=\left\{i\in\mathcal{U}\,|\,\pi_{i}<\pi_{j}\right\}italic_I start_POSTSUPERSCRIPT italic_π < italic_j end_POSTSUPERSCRIPT = { italic_i ∈ caligraphic_U | italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and likewise for superscripts π>j𝜋𝑗\pi>jitalic_π > italic_j, πj𝜋𝑗\pi\geq jitalic_π ≥ italic_j, etc. We first define the dominating root cause with respect to a permutation πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π.

Definition 3.4 (dominating root cause with respect to π𝜋\piitalic_π)

Node j𝑗jitalic_j is said to be the dominatingroot cause with respect to the permutation π𝜋\piitalic_π if we havefor alliIπj𝑖superscript𝐼𝜋𝑗i\in I^{\pi\leq j}italic_i ∈ italic_I start_POSTSUPERSCRIPT italic_π ≤ italic_j end_POSTSUPERSCRIPT

(i)C(Iπ<i|Iπi)<S(xn)/4,𝑖𝐶conditionalsuperscript𝐼𝜋𝑖superscript𝐼𝜋𝑖𝑆subscript𝑥𝑛4(i)\quad C(I^{\pi<i}|I^{\pi\geq i})<S(x_{n})/4,( italic_i ) italic_C ( italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT | italic_I start_POSTSUPERSCRIPT italic_π ≥ italic_i end_POSTSUPERSCRIPT ) < italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / 4 ,

and further, forall iIπj𝑖superscript𝐼𝜋𝑗i\in I^{\pi\geq j}italic_i ∈ italic_I start_POSTSUPERSCRIPT italic_π ≥ italic_j end_POSTSUPERSCRIPT, we have

(ii)C(Iπ>i|Iπi)<S(xn)/4.𝑖𝑖𝐶conditionalsuperscript𝐼𝜋𝑖superscript𝐼𝜋𝑖𝑆subscript𝑥𝑛4(ii)\quad C(I^{\pi>i}|I^{\pi\leq i})<S(x_{n})/4.( italic_i italic_i ) italic_C ( italic_I start_POSTSUPERSCRIPT italic_π > italic_i end_POSTSUPERSCRIPT | italic_I start_POSTSUPERSCRIPT italic_π ≤ italic_i end_POSTSUPERSCRIPT ) < italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / 4 .

The definition is motivated by the assumption that sets that do not contain j𝑗jitalic_j have no significant contribution. However, demanding that this holds for all subsets would be far too strong since Proposition3.2 cannot hold uniformly over all of them. Therefore we only demand it for two dedicated subsets (Iπ<isuperscript𝐼𝜋𝑖I^{\pi<i}italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT and Iπ>isuperscript𝐼𝜋𝑖I^{\pi>i}italic_I start_POSTSUPERSCRIPT italic_π > italic_i end_POSTSUPERSCRIPT) which are sufficient for our analysis.

First we observe that for alliIπ<j𝑖superscript𝐼𝜋𝑗i\in I^{\pi<j}italic_i ∈ italic_I start_POSTSUPERSCRIPT italic_π < italic_j end_POSTSUPERSCRIPT, together with Lemma 3.3, condition (i) of Def.3.4 implies

P(E|𝐧Iπ<i)e34S(xn).𝑃conditional𝐸subscript𝐧superscript𝐼𝜋𝑖superscript𝑒34𝑆subscript𝑥𝑛P(E|{\bf n}_{I^{\pi<i}})\leq e^{-\frac{3}{4}S(x_{n})}.italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 4 end_ARG italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .(12)

Further, for all iIπ>j𝑖superscript𝐼𝜋𝑗i\in I^{\pi>j}italic_i ∈ italic_I start_POSTSUPERSCRIPT italic_π > italic_j end_POSTSUPERSCRIPT Lemma3.3 and condition (ii) of Def.3.4 imply

P(E|𝐧Iπ<i)e14S(xn).𝑃conditional𝐸subscript𝐧superscript𝐼𝜋𝑖superscript𝑒14𝑆subscript𝑥𝑛P(E|{\bf n}_{I^{\pi<i}})\geq e^{-\frac{1}{4}S(x_{n})}.italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .(13)

In other words, if we increase the set of nodes for which the actual noise values are replaced with random values node by node, starting from the last one in the ordering π𝜋\piitalic_π, thenlogP(E|𝐧Iπ<i)𝑃conditional𝐸subscript𝐧superscript𝐼𝜋𝑖-\log P(E|{\bf n}_{I^{\pi<i}})- roman_log italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )jumps by at least S(xn)/2𝑆subscript𝑥𝑛2S(x_{n})/2italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / 2 when we reach the root cause i=j𝑖𝑗i=jitalic_i = italic_j.On the other hand, jumps of log probabilitiesare smaller than S(xn)/2𝑆subscript𝑥𝑛2S(x_{n})/2italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / 2 at every other node.To tell which of the bounds Eq.13and Eq.12,are validwe need to estimate the probabilities up to an error of

12(e14S(xn)e34S(xn))14e14S(xn):=δ.12superscript𝑒14𝑆subscript𝑥𝑛superscript𝑒34𝑆subscript𝑥𝑛14superscript𝑒14𝑆subscript𝑥𝑛assign𝛿\displaystyle\frac{1}{2}(e^{-\frac{1}{4}S(x_{n})}-e^{-\frac{3}{4}S(x_{n})})%\geq\frac{1}{4}e^{-\frac{1}{4}S(x_{n})}:=\delta.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 4 end_ARG italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT := italic_δ .(14)

The inequality assumes eS(xn)/21/2superscript𝑒𝑆subscript𝑥𝑛212e^{-S(x_{n})/2}\leq 1/2italic_e start_POSTSUPERSCRIPT - italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / 2 end_POSTSUPERSCRIPT ≤ 1 / 2, otherwise xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT should not be considered an outlier.Next we investigate how many samples we need to use to estimate the probabilities P(E|𝐧Iπ<i)𝑃conditional𝐸subscript𝐧superscript𝐼𝜋𝑖P(E\,|\,{\bf n}_{I^{\pi<i}})italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) up to error level δ𝛿\deltaitalic_δ:

Lemma 3.5 (Hoeffding bound)

To have P(|P^(E|𝐧Iπ<i)P(E|𝐧Iπ<i)|<δin)>1αP\left(|\hat{P}(E\,|\,{\bf n}_{I^{\pi<i}})-P(E\,|\,{\bf n}_{I^{\pi<i}})|<%\delta~{}\forall i\leq n\right)>1-\alphaitalic_P ( | over^ start_ARG italic_P end_ARG ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | < italic_δ ∀ italic_i ≤ italic_n ) > 1 - italic_α we need to use at least k=log2nα2δ2𝑘2𝑛𝛼2superscript𝛿2k=\frac{\log\frac{2n}{\alpha}}{2\delta^{2}}italic_k = divide start_ARG roman_log divide start_ARG 2 italic_n end_ARG start_ARG italic_α end_ARG end_ARG start_ARG 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG number of data samples.

As an example, for α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 and using n=50𝑛50n=50italic_n = 50 nodes, we would need 80eS(xn)/2absent80superscript𝑒𝑆subscript𝑥𝑛2\approx 80e^{S(x_{n})/2}≈ 80 italic_e start_POSTSUPERSCRIPT italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / 2 end_POSTSUPERSCRIPT number of data sample.If S(xn)𝑆subscript𝑥𝑛S(x_{n})italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) has been estimated via Eq.5, it is based on at least eS(xn)superscript𝑒𝑆subscript𝑥𝑛e^{S(x_{n})}italic_e start_POSTSUPERSCRIPT italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT samples.

For practical applications, we proposeAlgorithm1 (TOCA, Topologically Ordered Contribution Analysis) which simply infers the node i𝑖iitalic_i for which the estimated log probabilities show the largest jump as the root cause. In what comes next, we discuss different properties of TOCA.

Conceptual difference to traversal algorithm:When we have identified a node j𝑗jitalic_j for which the log probabilities of E𝐸Eitalic_E jump when including j𝑗jitalic_j in the conditioning set, we can be certain that j𝑗jitalic_j is a root cause in the sense of having high contribution.This is not the case for the traversal algorithm(as used by[27, 28, 9]), which infers the most upstream anomaly as root cause:Let, for instance,

X1subscript𝑋1\displaystyle X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=N1,absentsubscript𝑁1\displaystyle=N_{1},= italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,
X2subscript𝑋2\displaystyle X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.1X1+N2,absent0.1subscript𝑋1subscript𝑁2\displaystyle=0.1\cdot X_{1}+N_{2},= 0.1 ⋅ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

whereX𝑋Xitalic_X and N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are standard Gaussians 𝒩(0,1)𝒩01{\cal N}(0,1)caligraphic_N ( 0 , 1 ).When x1=3subscript𝑥13x_{1}=-3italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 3 and n2=3subscript𝑛23n_{2}=3italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3,traversal identifies x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the root cause (because both values x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are anomalous), although a more typical value of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT would have made the target outlier x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT even stronger. Hence, x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has even a negative contribution in the above sense.

Why do we measure gaps of log probabilities?From the perspective of statistical estimation, it would certainly be easier to seek for the index with the largest increase of probabilityinstead of the largest increase of logarithmic probability of E𝐸Eitalic_E. Why don’t we work with a definition of contribution that is defined by additive increase rather than multiplicative increase?The following example shows that high multiplicative increase can only be caused by rare events, while high additive increase can happen quite frequently: Consider the boolean causal relation between 3 binaries E=N1N2𝐸subscript𝑁1subscript𝑁2E=N_{1}\wedge N_{2}italic_E = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with P(N1=1)=1/2𝑃subscript𝑁1112P(N_{1}=1)=1/2italic_P ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 ) = 1 / 2 while P(N2=1)=0.001𝑃subscript𝑁210.001P(N_{2}=1)=0.001italic_P ( italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 ) = 0.001.Then we have P(E=1|N2=1)=1/2𝑃𝐸conditional1subscript𝑁2112P(E=1|N_{2}=1)=1/2italic_P ( italic_E = 1 | italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 ) = 1 / 2, while P(E=1|N1=1,N2=1)=1P(E=1|N_{1}=1,N_{2}=1)=1italic_P ( italic_E = 1 | italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 ) = 1.Hence, n1=1subscript𝑛11n_{1}=1italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 increases the probability of E𝐸Eitalic_E by 1/2121/21 / 2, although this value occurs quite frequently. In contrast, a large multiplicative increase is only achieved by the rare event n2=1subscript𝑛21n_{2}=1italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 (in agreement with Proposition3.2).The conclusion that large contribution comes from a corrupted mechanism (‘the root cause’) is only justified because high contributions are unlikely otherwise.

Why do we choose a topological order?So far, we have not made use of the fact that π𝜋\piitalic_π is a topological ordering of the causal DAG. We now employProposition2.1and note that we can replace all our probabilities with observational probabilities:

P(E|𝐧Iπi)=P(E|𝐱Iπi).𝑃conditional𝐸subscript𝐧superscript𝐼𝜋𝑖𝑃conditional𝐸subscript𝐱superscript𝐼𝜋𝑖P(E|{\bf n}_{I^{\pi\leq i}})=P(E|{\bf x}_{I^{\pi\leq i}}).italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π ≤ italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_P ( italic_E | bold_x start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π ≤ italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) .(15)

In practice, the easiest way to sample from the conditional distribution P(E|𝐱Iπi)𝑃conditional𝐸subscript𝐱superscript𝐼𝜋𝑖P(E|{\bf x}_{I^{\pi\geq i}})italic_P ( italic_E | bold_x start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π ≥ italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) would still be to just sample the noise values.This may come as a surprise given that we have advertised not relying on the structural equation as an advantage of the topological ordering.However, the crucial point is that we have nevertheless circumvented the problem of SCMs being ambiguous: We can choose any SCMthat generates the observed conditional distribution, regardless of whether it properly describes the counterfactuals.This is because the obtained contributions only depend on the observational conditional probabilities due toEq.15.

1:Input: outlier xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the FCM between the variables X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, a topological ordering of nodes π𝜋\piitalic_π

2:Initialize: 𝒜={0}i=1n𝒜superscriptsubscript0𝑖1𝑛\mathcal{A}=\left\{0\right\}_{i=1}^{n}caligraphic_A = { 0 } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

3:foritr{1,,k}itr1𝑘\text{itr}\in\left\{1,\ldots,k\right\}itr ∈ { 1 , … , italic_k }do

4:𝐧~=(n1,,nn)~𝐧subscriptsuperscript𝑛1subscriptsuperscript𝑛𝑛\tilde{\bf n}=(n^{\prime}_{1},\ldots,n^{\prime}_{n})over~ start_ARG bold_n end_ARG = ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

5:𝒮=𝒮{\mathcal{S}}=\emptysetcaligraphic_S = ∅

6:foriReverse order ofπ𝑖Reverse order of𝜋i\in\text{Reverse order of }\piitalic_i ∈ Reverse order of italic_πdo

7:𝒮𝒮{i}𝒮𝒮𝑖{\mathcal{S}}\rightarrow{\mathcal{S}}\cup\left\{i\right\}caligraphic_S → caligraphic_S ∪ { italic_i }

8:𝒜i𝒜i+𝕀[g(F(𝐧~𝒮))g(xn)]subscript𝒜𝑖subscript𝒜𝑖𝕀delimited-[]𝑔𝐹subscript~𝐧𝒮𝑔subscript𝑥𝑛\mathcal{A}_{i}\leftarrow\mathcal{A}_{i}+\mathbb{I}[g(F(\tilde{\bf n}_{{%\mathcal{S}}}))\geq g(x_{n})]caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + blackboard_I [ italic_g ( italic_F ( over~ start_ARG bold_n end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) ) ≥ italic_g ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ]

9:endfor

10:endfor

11:return argmaxi{1,,n}log𝒜i1𝒜isubscriptargmax𝑖1𝑛subscript𝒜𝑖1subscript𝒜𝑖\mathop{\mathrm{argmax}}_{i\in\left\{1,\ldots,n\right\}}\log\frac{\mathcal{A}_%{i-1}}{\mathcal{A}_{i}}roman_argmax start_POSTSUBSCRIPT italic_i ∈ { 1 , … , italic_n } end_POSTSUBSCRIPT roman_log divide start_ARG caligraphic_A start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG

4 Causal relations between outliers via calibrated anomaly scores

We have discussed methods that do not require SCMs and require only the graphical model, that is, the causal DAG endowed with observational conditional probabilities P(Xj|Paj)𝑃conditionalsubscript𝑋𝑗𝑃subscript𝑎𝑗P(X_{j}|Pa_{j})italic_P ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). However, inferring the latter is statistically ill-posed, particularly when the number of parents is large. Whilethe additive noise assumption reduces the problem essentially to the regression problem of estimating conditional expectations, the problem remains ill-posed becauseanalyzing outliers amounts to probing the regression models in regions with low density.Here we describe approaches that infer causal relations between anomaliesfrom a given causal DAG together with marginal anomaly scores S(xj)𝑆subscript𝑥𝑗S(x_{j})italic_S ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), using the simple estimator Eq.5. To this end, we start with the toy example of cause-effect pairs and then generalize it to more than two variables.

4.1 Outliers of cause-effect pairs

Let the causal DAG be XY𝑋𝑌X\to Yitalic_X → italic_Y for two variables X,Y𝑋𝑌X,Yitalic_X , italic_Y, where we observe anomalies with anomaly scores S(x),S(y)𝑆𝑥𝑆𝑦S(x),S(y)italic_S ( italic_x ) , italic_S ( italic_y ) that occur together.We argue now that, the anomaly x𝑥xitalic_x is implausible to be the unique root cause of the anomaly y𝑦yitalic_y if S(x)S(y)much-less-than𝑆𝑥𝑆𝑦S(x)\ll S(y)italic_S ( italic_x ) ≪ italic_S ( italic_y ).This can be proven subject to the following assumptions:

  1. 1.

    Monotonicity: increasing the anomaly score of X𝑋Xitalic_X does not decreasethe probability of an anomaly event at Y𝑌Yitalic_Y:

    P{S(Y)S(y)|S(X)=S(x)}P{S(Y)S(y)|S(X)S(x)},𝑃conditional-set𝑆𝑌𝑆𝑦𝑆𝑋𝑆𝑥𝑃conditional-set𝑆𝑌𝑆𝑦𝑆𝑋𝑆𝑥\displaystyle P\left\{S(Y)\geq S(y)|S(X)=S(x)\right\}\leq P\left\{S(Y)\geq S(y%)|S(X)\geq S(x)\right\},italic_P { italic_S ( italic_Y ) ≥ italic_S ( italic_y ) | italic_S ( italic_X ) = italic_S ( italic_x ) } ≤ italic_P { italic_S ( italic_Y ) ≥ italic_S ( italic_y ) | italic_S ( italic_X ) ≥ italic_S ( italic_x ) } ,(16)
  2. 2.

    Injectivity: the mapping xS(x)maps-to𝑥𝑆𝑥x\mapsto S(x)italic_x ↦ italic_S ( italic_x ) is one-to-one. This is the case, for instance, for scores obtained from one-sided tail probabilities, that is for g(x):=xassign𝑔𝑥𝑥g(x):=xitalic_g ( italic_x ) := italic_x.

We then define two different null hypotheses stating that the mechanisms P(X)𝑃𝑋P(X)italic_P ( italic_X ) or P(Y|X)𝑃conditional𝑌𝑋P(Y|X)italic_P ( italic_Y | italic_X ), respectively, worked as usual: H0Xsubscriptsuperscript𝐻𝑋0H^{X}_{0}italic_H start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:x𝑥xitalic_x has been drawn from P(X)𝑃𝑋P(X)italic_P ( italic_X ), and H0Ysubscriptsuperscript𝐻𝑌0H^{Y}_{0}italic_H start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: y𝑦yitalic_y has been drawn from P(Y|X=x)𝑃conditional𝑌𝑋𝑥P(Y|X=x)italic_P ( italic_Y | italic_X = italic_x ) (where P(X,Y)𝑃𝑋𝑌P(X,Y)italic_P ( italic_X , italic_Y )denotes the joint distribution of X,Y𝑋𝑌X,Yitalic_X , italic_Y in the normal mode).Note that H0Ysubscriptsuperscript𝐻𝑌0H^{Y}_{0}italic_H start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT allows that X𝑋Xitalic_X is drawn from an arbitrary distribution instead of P(X)𝑃𝑋P(X)italic_P ( italic_X ), only the mechanism generating y𝑦yitalic_y from x𝑥xitalic_x has remained the same. We then have the following criteria for rejecting H0Xsuperscriptsubscript𝐻0𝑋H_{0}^{X}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPTand H0Ysuperscriptsubscript𝐻0𝑌H_{0}^{Y}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT:

Lemma 4.1 (rejecting H0Xsubscriptsuperscript𝐻𝑋0H^{X}_{0}italic_H start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT)

H0Xsubscriptsuperscript𝐻𝑋0H^{X}_{0}italic_H start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be rejected at level peS(x)𝑝superscript𝑒𝑆𝑥p\leq e^{-S(x)}italic_p ≤ italic_e start_POSTSUPERSCRIPT - italic_S ( italic_x ) end_POSTSUPERSCRIPT.

This follows immediately from the definition of IT scores in Eq.4.On the other hand, we obtain:

Lemma 4.2 (rejecting H0Ysuperscriptsubscript𝐻0𝑌H_{0}^{Y}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT)

Subject to Monotonicity and Injectivity assumptions, H0Ysubscriptsuperscript𝐻𝑌0H^{Y}_{0}italic_H start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be rejected at level pe(S(y)S(x))𝑝superscript𝑒𝑆𝑦𝑆𝑥p\leq e^{-(S(y)-S(x))}italic_p ≤ italic_e start_POSTSUPERSCRIPT - ( italic_S ( italic_y ) - italic_S ( italic_x ) ) end_POSTSUPERSCRIPT.

Lemma4.1 and Lemma4.2 entail interesting consequences for the case where the causal direction is not known: Let S(x)=10,S(y)=5formulae-sequence𝑆𝑥10𝑆𝑦5S(x)=10,S(y)=5italic_S ( italic_x ) = 10 , italic_S ( italic_y ) = 5. For XY𝑋𝑌X\to Yitalic_X → italic_Y we can only reject H0Xsuperscriptsubscript𝐻0𝑋H_{0}^{X}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, while mechanism P(Y|X)𝑃conditional𝑌𝑋P(Y|X)italic_P ( italic_Y | italic_X ) possibly worked as expected. However, for YX𝑌𝑋Y\to Xitalic_Y → italic_X, we would reject both hypotheses,that P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ) and that P(X|Y)𝑃conditional𝑋𝑌P(X|Y)italic_P ( italic_X | italic_Y ) worked as expected with p-level e5superscript𝑒5e^{-5}italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT each. Following the working hypothesis that at most one mechanism was corrupted we thus prefer XY𝑋𝑌X\to Yitalic_X → italic_Y.

Small sample versionWhen the anomaly scores areestimated froma small number of observations according to Eq.5,comparison of scores canstill be insightful:ifS^(y)S^(x)much-greater-than^𝑆𝑦^𝑆𝑥\hat{S}(y)\gg\hat{S}(x)over^ start_ARG italic_S end_ARG ( italic_y ) ≫ over^ start_ARG italic_S end_ARG ( italic_x ), thenfor a large fraction ofobservations (x~,y~)~𝑥~𝑦(\tilde{x},\tilde{y})( over~ start_ARG italic_x end_ARG , over~ start_ARG italic_y end_ARG ) for which g(x~)g(x)𝑔~𝑥𝑔𝑥g(\tilde{x})\geq g(x)italic_g ( over~ start_ARG italic_x end_ARG ) ≥ italic_g ( italic_x ) the statement g(y~)g(y)𝑔~𝑦𝑔𝑦g(\tilde{y})\geq g(y)italic_g ( over~ start_ARG italic_y end_ARG ) ≥ italic_g ( italic_y )does not hold. Hence, the eventg(x~)g(x)𝑔~𝑥𝑔𝑥g(\tilde{x})\geq g(x)italic_g ( over~ start_ARG italic_x end_ARG ) ≥ italic_g ( italic_x ) is not a sufficient cause [44] forthe event g(y~)g(y)𝑔~𝑦𝑔𝑦g(\tilde{y})\geq g(y)italic_g ( over~ start_ARG italic_y end_ARG ) ≥ italic_g ( italic_y ).In this sense,we can still exclude the value with a smaller score as a candidate for the root cause,if we are interested in sufficient causes.

4.2 Outliers in DAGs with multiple nodes

To inferwhich node in a DAG with n𝑛nitalic_n nodes has been corrupted,we start with thehypothesis that all nodes except j𝑗jitalic_j worked as expected:H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: for the anomaly event (x1,,xn)subscript𝑥1subscript𝑥𝑛(x_{1},\dots,x_{n})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) all xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with ij𝑖𝑗i\neq jitalic_i ≠ italic_j have been drawn fromP(Xi|Pai=pai)𝑃conditionalsubscript𝑋𝑖𝑃subscript𝑎𝑖𝑝subscript𝑎𝑖P(X_{i}|Pa_{i}=pa_{i})italic_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).To define a test statistics,we first introduce theconditional outlier scoresfrom [45]:

S(xi|pai):=logP(g(Xi)g(xi)|Pai=pai),assign𝑆conditionalsubscript𝑥𝑖𝑝subscript𝑎𝑖𝑃𝑔subscript𝑋𝑖conditional𝑔subscript𝑥𝑖𝑃subscript𝑎𝑖𝑝subscript𝑎𝑖S(x_{i}|pa_{i}):=-\log P(g(X_{i})\geq g(x_{i})|Pa_{i}=pa_{i}),italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) := - roman_log italic_P ( italic_g ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where g𝑔gitalic_g is some feature function as before (we actually allow variable-dependent feature functions gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but they need to coincide between conditional and unconditional outlier scores).With variable input xi,paisubscript𝑥𝑖𝑝subscript𝑎𝑖x_{i},pa_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, they definerandom variables S(Xi|Pai)𝑆conditionalsubscript𝑋𝑖𝑃subscript𝑎𝑖S(X_{i}|Pa_{i})italic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).Further, we add the following assumption to Monotonicity and Injectivity from Subsection4.1:

  • 3.

    Continuity:all variables Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are continuous with density w.r.t. the Lebesgue measure, and alsoall conditional distributions P(Xi|Pai=pai)𝑃conditionalsubscript𝑋𝑖𝑃subscript𝑎𝑖𝑝subscript𝑎𝑖P(X_{i}|Pa_{i}=pa_{i})italic_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) have such a density.

This condition ensures that all the conditional outlier scores are distributed according to the density p(s)=es𝑝𝑠superscript𝑒𝑠p(s)=e^{-s}italic_p ( italic_s ) = italic_e start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPT, because then the cumulative distribution function is uniformly distributed.It entails a property that will be convenient for testing H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

Lemma 4.3 (independence of conditional outlier scores)

Subject to the continuity assumption,
{S(X1|Pa1),,S(Xn|Pan)}𝑆conditionalsubscript𝑋1𝑃subscript𝑎1𝑆conditionalsubscript𝑋𝑛𝑃subscript𝑎𝑛\left\{S(X_{1}|Pa_{1}),\cdots,S(X_{n}|Pa_{n})\right\}{ italic_S ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , italic_S ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } are independent random variables.

Independence of conditional scoresenables the definition of a simple test statistics that is obtained by summing over the individual ones888Note that this resembles known methods of aggregating p-values, but on a logarithmic scale [46]. together with a correction term. With the cumulative score: Scum:=ijS(Xi|Pai)assignsubscript𝑆𝑐𝑢𝑚subscript𝑖𝑗𝑆conditionalsubscript𝑋𝑖𝑃subscript𝑎𝑖S_{cum}:=\sum_{i\neq j}S(X_{i}|Pa_{i})italic_S start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we have that:

S:=Scumlogl=0n2Scumll!.assign𝑆subscript𝑆𝑐𝑢𝑚superscriptsubscript𝑙0𝑛2superscriptsubscript𝑆𝑐𝑢𝑚𝑙𝑙S:=S_{cum}-\log\sum_{l=0}^{n-2}\frac{S_{cum}^{l}}{l!}.italic_S := italic_S start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT - roman_log ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT divide start_ARG italic_S start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_l ! end_ARG .(17)

To understand Eq.17,note that the second term is needed to ‘recalibrate’ since the sumof independent IT scores is no longer an IT score, but with the extra term, it is. Intuitively, one may think of the extra term as a multi-hypothesis testing correction:the sum of independent IT scores is likely to be large because it is not unlikely that the set contains one large term. The following result states that this is quantitatively the right correction:

Lemma 4.4 (distribution of cumulated score)

If H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT holds, Eq.17is distributed according tothe density p(s)=es𝑝𝑠superscript𝑒𝑠p(s)=e^{-s}italic_p ( italic_s ) = italic_e start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPT.

As a direct result of the above lemmas we have:

Theorem 4.5 (p value for H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from conditional outlier scores)

H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be rejectedfor the observation(x1,,xn)subscript𝑥1subscript𝑥𝑛(x_{1},\dots,x_{n})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )with p-value

p=eScuml=1n2Scumll!.𝑝superscript𝑒subscript𝑆𝑐𝑢𝑚superscriptsubscript𝑙1𝑛2superscriptsubscript𝑆𝑐𝑢𝑚𝑙𝑙p=e^{-S_{cum}}\cdot\sum_{l=1}^{n-2}\frac{S_{cum}^{l}}{l!}.italic_p = italic_e start_POSTSUPERSCRIPT - italic_S start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT divide start_ARG italic_S start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_l ! end_ARG .

The theorem justifies choosing the index j𝑗jitalic_j with the maximal conditional outlier score as the root cause, whenever we follow the working hypothesis that only one mechanism is corrupted.

We started this subsection with the goalof avoiding the estimation of conditional probabilities.Therefore, we will now replace conditional outlier scores with bounds derived from marginal anomaly scores, following the ideas from the bivariate case.To this end, we replace the monotonicity assumption from there with two stronger ones. We introduce the ‘outlier events’Ei:={g(Xi)g(xi)}assignsubscript𝐸𝑖𝑔subscript𝑋𝑖𝑔subscript𝑥𝑖E_{i}:=\{g(X_{i})\geq g(x_{i})\}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := { italic_g ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } and postulate:

  • 1a.

    Monotonicity:

    P{En|g(Pai)=g(pai)}P{En|g(Pai1)g(pai1),,g(Paik)g(paik)},𝑃conditional-setsubscript𝐸𝑛𝑔𝑃subscript𝑎𝑖𝑔𝑝subscript𝑎𝑖𝑃conditional-setsubscript𝐸𝑛formulae-sequence𝑔𝑃subscriptsuperscript𝑎1𝑖𝑔𝑝subscriptsuperscript𝑎1𝑖𝑔𝑃subscriptsuperscript𝑎𝑘𝑖𝑔𝑝subscriptsuperscript𝑎𝑘𝑖\displaystyle P\{E_{n}|g(Pa_{i})=g(pa_{i})\}\leq P\{E_{n}|g(Pa^{1}_{i})\geq g(%pa^{1}_{i}),\dots,g(Pa^{k}_{i})\geq g(pa^{k}_{i})\},italic_P { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_g ( italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_g ( italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ≤ italic_P { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_g ( italic_P italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_g ( italic_p italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , italic_g ( italic_P italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_g ( italic_p italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ,
  • 1b.

    Non-negative dependence:For any k𝑘kitalic_k nodes {i1,,ik}subscript𝑖1subscript𝑖𝑘\left\{i_{1},\dots,i_{k}\right\}{ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }:

    P(Ei1,,Eik)P(Ei1)P(Eik),𝑃subscript𝐸subscript𝑖1subscript𝐸subscript𝑖𝑘𝑃subscript𝐸subscript𝑖1𝑃subscript𝐸subscript𝑖𝑘P(E_{i_{1}},\dots,E_{i_{k}})\geq P(E_{i_{1}})\cdots P(E_{i_{k}}),italic_P ( italic_E start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ italic_P ( italic_E start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋯ italic_P ( italic_E start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where Pai1,,Paik𝑃subscriptsuperscript𝑎1𝑖𝑃subscriptsuperscript𝑎𝑘𝑖Pa^{1}_{i},\dots,Pa^{k}_{i}italic_P italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_P italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the parents ofXisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Both conditions are close in spirit because they assume mutual positive influence of outlier events. While 1a is explicitly causal, 1b is actually a purely statisticalcondition, but also justified by an implicit causal model in whichoutliers at one point of the system render outliers at other points more likely and not less.We then obtain the following bound on the joint statistics Scumsubscript𝑆𝑐𝑢𝑚S_{cum}italic_S start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT:

Theorem 4.6 (p value from increment of scores)

With the cumulative anomaly score increments:S^cum=ij|S(xi)lS(pail)|+subscript^𝑆𝑐𝑢𝑚subscript𝑖𝑗subscript𝑆subscript𝑥𝑖subscript𝑙𝑆𝑝subscriptsuperscript𝑎𝑙𝑖\hat{S}_{cum}=-\sum_{i\neq j}\left|S(x_{i})-\sum_{l}S(pa^{l}_{i})\right|_{+}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_S ( italic_p italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, it holds that if conditions 1a and 1b hold, H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be rejected with

peS^cuml=1n2S^cumll!.𝑝superscript𝑒subscript^𝑆𝑐𝑢𝑚superscriptsubscript𝑙1𝑛2superscriptsubscript^𝑆𝑐𝑢𝑚𝑙𝑙p\leq e^{-\hat{S}_{cum}}\cdot\sum_{l=1}^{n-2}\frac{\hat{S}_{cum}^{l}}{l!}.italic_p ≤ italic_e start_POSTSUPERSCRIPT - over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 2 end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_l ! end_ARG .

4.3 Finding root causes by minimizing score differences

Theorem4.6 has interesting consequences for the scenario wheren𝑛nitalic_n variables form the causal chain

X1X2Xn.subscript𝑋1subscript𝑋2subscript𝑋𝑛X_{1}\to X_{2}\cdots\to X_{n}.italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ → italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .(18)

In this case, S^cumsubscript^𝑆cum\hat{S}_{\text{cum}}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT cum end_POSTSUBSCRIPT reduces tothe sum of score differences for adjacent nodes.Accordingly, H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can then be rejectedat p-value:

p{eS(x1)ij,i2|S(xi)S(xi1)|+fori1ei2|S(xi)S(xi1)|+fori=1𝑝casessuperscript𝑒𝑆subscript𝑥1subscriptformulae-sequence𝑖𝑗𝑖2subscript𝑆subscript𝑥𝑖𝑆subscript𝑥𝑖1for𝑖1superscript𝑒subscript𝑖2subscript𝑆subscript𝑥𝑖𝑆subscript𝑥𝑖1for𝑖1p\leq\left\{\begin{array}[]{c}e^{-S(x_{1})-\sum_{i\neq j,i\geq 2}|S(x_{i})-S(x%_{i-1})|_{+}}\hbox{ for }i\neq 1\\e^{-\sum_{i\geq 2}|S(x_{i})-S(x_{i-1})|_{+}}\hbox{ for }i=1\end{array}\right.italic_p ≤ { start_ARRAY start_ROW start_CELL italic_e start_POSTSUPERSCRIPT - italic_S ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j , italic_i ≥ 2 end_POSTSUBSCRIPT | italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for italic_i ≠ 1 end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ≥ 2 end_POSTSUBSCRIPT | italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for italic_i = 1 end_CELL end_ROW end_ARRAY(19)

We conclude that H0jsubscriptsuperscript𝐻𝑗0H^{j}_{0}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT needs to be rejectedwhenever the anomaly score increases significantly along the chain of anomalies at any ij𝑖𝑗i\neq jitalic_i ≠ italic_j.This justifies inferring the index j𝑗jitalic_j as the unique root cause that maximizes thescore difference |S(xj)S(xj1)|𝑆subscript𝑥𝑗𝑆subscript𝑥𝑗1|S(x_{j})-S(x_{j-1})|| italic_S ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) |(with the difference being |S(x1)0|𝑆subscript𝑥10|S(x_{1})-0|| italic_S ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 0 | for the first node with ‘an imaginary non-anomalous node’ X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT)becausethis yields the weakest bound as in Eq.19. Motivated by these bounds, we propose the algorithm SMOOTH TRAVERSAL whichselects the node as the root cause that shows the strongest increase in anomaly scorecompared to its most anomalous parent. In contrast to the usual Traversal[27, 28, 9], this avoids choosing an arbitrary threshold above which a node is considered anomalous.

4.4 Shortlist of root causes via ordering anomaly scores

We now drop the final assumption and assume that the causal DAG is not known and we are only given the marginal anomaly scores S(x1),,S(xn)𝑆subscript𝑥1𝑆subscript𝑥𝑛S(x_{1}),\dots,S(x_{n})italic_S ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of each node.How can we find the root cause?We now argue that the top-scored anomalies provide a short list of root causes. The idea is, again, that the anomaly xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTis unlikely to cause the downstream anomaly xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if S(xi)S(xk)much-less-than𝑆subscript𝑥𝑖𝑆subscript𝑥𝑘S(x_{i})\ll S(x_{k})italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≪ italic_S ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), unless we allow additional mechanisms (except for i𝑖iitalic_i) to be corrupted. To show this, we restrict the attention to a DAG with three nodes, in which we have merged all paths from Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTto the arrow XiXksubscript𝑋𝑖subscript𝑋𝑘X_{i}\to X_{k}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and all nodes upstream of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a single huge confounder Xlsubscript𝑋𝑙X_{l}italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. In analogy tothe proof ofTheorem4.6 andwith the same assumptions, we can bound conditional outlier scorewith the score difference to the sum the parent scores:S(xk|xi,xl)|S(xk)S(xi)S(xl)|+𝑆conditionalsubscript𝑥𝑘subscript𝑥𝑖subscript𝑥𝑙subscript𝑆subscript𝑥𝑘𝑆subscript𝑥𝑖𝑆subscript𝑥𝑙S(x_{k}|x_{i},x_{l})\geq|S(x_{k})-S(x_{i})-S(x_{l})|_{+}italic_S ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ≥ | italic_S ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT.Assuming S(xl)0𝑆subscript𝑥𝑙0S(x_{l})\approx 0italic_S ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ≈ 0, the hypothesis that the mechanism P(XK|Xi,Xl)𝑃conditionalsubscript𝑋𝐾subscript𝑋𝑖subscript𝑋𝑙P(X_{K}|X_{i},X_{l})italic_P ( italic_X start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )worked as expected, can be rejected at a p-level of about eS(xi)S(xl)superscript𝑒𝑆subscript𝑥𝑖𝑆subscript𝑥𝑙e^{S(x_{i})-S(x_{l})}italic_e start_POSTSUPERSCRIPT italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT.This motivates the algorithm SCORE ORDERING which chooses top-scored anomalies as candidates for root causes. We do not provide pseudocode due to its simplicity.

5 Experiments

We provide a comparison of the performance of different approaches on simulated and real-world data. We compare our algorithms TOCA, SMOOTH TRAVERSAL, and SCORE ORDERING to ‘Traversal’[9, 27, 28], ‘Counterfactual’[23], and Circa [25]. We provide a short description of each of the algorithms below:

  • TOCA (ours, refer to Algorithm1) it traverses in the reverse topological ordering of the nodes and identifies the node with the highest jump in estimated log probabilities as the root cause.

  • SMOOTH TRAVERSAL (ours, refer to Section4.3) identifies the node that shows strongest increase of anomaly score compared to its most anomalous parent as the root cause.

  • SCORE ORDERING (ours, refer to Section4.4) takes only the node with the highest anomaly score as the root cause. It is the only approach that does not require the causal graph or SCM and uses only the estimated anomaly scores.

  • ‘Traversal’[9, 27, 28] identifies a node as a root cause under the two conditions: 1) none of its parents exhibit anomalous behaviour, and 2) it is linked to the anomalous target node through a path of anomalous nodes.

  • Circa[25] fits a linear Gaussian SCM to the data in the non-anomalous regime and compares the predicted value of each variable, given its parents (hence, it uses the causal graph), to the observed value in the anomalous regime. The node whose value differs the most is identified as the root cause.

  • Counterfactual[23] finds the Shapley contribution of each node to the outlierness in the target through full counterfactual analysis and outputs the node with the highest Shapley value as the root cause.

We assume that only a single data point is available in the anomalous regime.We, therefore, do not compare to approaches that involve learning the causal graph (e.g., Root Cuase Discovery [31]), nor do we compare to approaches such as ε𝜀\varepsilonitalic_ε-Diagnosis [11] which perform two-sample tests to identify which nodes have changed significantly from the non-anomalous to the anomalous regimes.

We generate random SCM with 50 nodes with linear and non-linear structural equations (more details on the generation process in AppendixB) and sample 1000 observations of the SCM for the non-anomalous regime to train the IT anomaly scorer (as in Eq.5).To produce anomalous data we choose a root cause at random from the list of all nodes and a target node from the set of its descendants (including itself).Second, we sample the noise of all variables and modify the noise of the root cause by adding x𝑥xitalic_x standard deviations, where x{2,2.1,,3}𝑥22.13x\in\{2,2.1,\dots,3\}italic_x ∈ { 2 , 2.1 , … , 3 }, and propagate the noises through the SCM to obtain a realization of each variable.We repeat this process 100 times for each x𝑥xitalic_x value and consider the algorithm successful if its top-ranking result coincides with the chosen root cause.

Through our experiments, we aim to answer the following questions:

How do different algorithms compare and how does this performance depend on the strength of the anomaly at the root cause?In Fig.1 we look at the performance of different algorithms against the strength (the number of standard deviations that the noise term deviates from its mean) of the anomaly at the root cause. We find that the strongest performing algorithms are SMOOTH TRAVERSAL, Traversal, and Counterfactual, all of which outperform TOCA and SCORE ORDERING, which have comparable performance to each other.Circa performs considerably worse than the other approaches which we suspect is due to the assumption of linearity.The performance of TOCA and SCORE ORDERING improves as the strength of the anomaly increases. This is expected given that both algorithms aim to find the unique anomalous node.

How do the approaches scale with increasing graph sizes (both in performance and run times)?For a fixed amount of injected perturbation to the root cause (3 standard deviations away from its mean), we look at the runtimes of different algorithms for an SCM with 100 nodes generated similarly as described above. As in Fig.2 in AppendixB,Traversal and SMOOTH TRAVERSAL are the fastest, with the remaining approaches having comparable average run times, however, Counterfactual has a long tail, with times for 50 node graphs running into the tens of seconds.

How is performance affected by increasing causal graph density?We generate SCMs with the number of nodes in the set {20,40,,100}2040100\left\{20,40,\ldots,100\right\}{ 20 , 40 , … , 100 } with a fixed amount of injected perturbation to the root cause (3 standard deviations away from its mean). In Fig.3 in AppendixB we see that the performance of SCORE ORDERING and TOCA slightly decreases with the size of the SCM.

Root Cause Analysis of Outliers with Missing Structural Knowledge (1)

When comparing the performance we should however keep in mind that Counterfactual requires the SCM, while Traversal and SMOOTH TRAVERSAL require only the causal graph, this is a clear disadvantage of Counterfactual against the other two algorithms.In addition, SMOOTH TRAVERSAL does not have a free parameter (thresholding scores), which Traversal does.Both SCORE ORDERING and TOCA have similar performance, but SCORE ORDERING does not even require information of the causal graph, whereas TOCA does.Nevertheless, it has the advantage over all others (except Counterfactual), that the inferred gap of log probabilities witnesses a true causal contribution, which is not the case for other methods since upstream anomalies may not be causally responsible for the target anomaly (as argued in Section3).

We have also tried our algorithms on the ’PetShop dataset’ [29], where the task is to infer the root cause of performance drops in a cloud-based application. This presents a challenging task due to high missingness, low sample sizes, and near-constant variables.Furthermore, the causal ground truth is only partially known if one accepts that the inverted call graph (showing which services call which in the application) is a proxy for the causal graph.The results are shown in AppendixB.3, SMOOTH TRAVERSAL and SCORE ORDERING perform well, while TOCA fails, probably because it relies heavily not only on the causal graph but also on the SCM.

6 Conclusions

Without challenging the approach of [23] for a clean definition of root cause and its quantitative contribution, we have exploredseveral directions in which the practical limitationsof the corresponding algorithm provided in DoWhy[47]can be mitigated without sacrificing too much of its theoretical justification: First, by avoiding rung 3 causal models and high computational load, second by avoiding estimation of conditional probabilities, and third by not even relying on the causal DAG.

References

  • [1]SWibisono, MTAnwar, Aji Supriyanto, and IHA Amin.Multivariate weather anomaly detection using dbscan clusteringalgorithm.In Journal of Physics: Conference Series, volume 1869, page012077. IOP Publishing, 2021.
  • [2]EricV Strobl and ThomasA Lasko.Identifying patient-specific root causes with the heteroscedasticnoise model.Journal of Computational Science, 72:102099, 2023.
  • [3]EricV Strobl.Counterfactual formulation of patient-specific root causes ofdisease.Journal of Biomedical Informatics, page 104585, 2024.
  • [4]GianAntonio Susto, Matteo Terzi, and Alessandro Beghi.Anomaly detection approaches for semiconductor manufacturing.Procedia Manufacturing, 11:2018–2024, 2017.
  • [5]Jellis Vanhoeyveld, David Martens, and Bruno Peeters.Value-added tax fraud detection with scalable anomaly detectiontechniques.Applied Soft Computing, 86:105895, 2020.
  • [6]Sanjiv Das, Richard Stanton, and Nancy Wallace.Algorithmic fairness.Annual Review of Financial Economics, 15:565–593, 2023.
  • [7]Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu,and Haifeng Chen.Hierarchical graph neural networks for causal discovery and rootcause localization.arXiv preprint arXiv:2302.01987, 2023.
  • [8]Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, and Wen-Chih Peng.Root cause analysis in microservice using neural granger causaldiscovery.arXiv preprint arXiv:2402.01140, 2024.
  • [9]Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li,Jiayu Ou, and Zheshun Wu.Microhecl: high-efficient root cause localization in large-scalemicroservice systems.In Proceedings of the 43rd International Conference on SoftwareEngineering: Software Engineering in Practice, ICSE-SEIP ’21, page338–347. IEEE Press, 2021.
  • [10]Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, andMurat Kocaoglu.Root cause analysis of failures in microservices through causaldiscovery.Advances in Neural Information Processing Systems,35:31158–31170, 2022.
  • [11]Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He,Min Li, and Wei Ding.ε𝜀\varepsilonitalic_ε-diagnosis: Unsupervised and real-time diagnosis ofsmall-window long-tail latency in large-scale microservice platforms.In The World Wide Web Conference, pages 3215–3222, 2019.
  • [12]Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang.Automap: Diagnose your microservice-based web applicationsautomatically.In Proceedings of The Web Conference 2020, pages 246–258,2020.
  • [13]DouglasM Hawkins.Identification of outliers, volume11.Springer, 1980.
  • [14]Varun Chandola, Arindam Banerjee, and Vipin Kumar.Anomaly detection: A survey.ACM computing surveys (CSUR), 41(3):1–58, 2009.
  • [15]CharuC Aggarwal.An introduction to outlier analysis.Springer, 2017.
  • [16]Ane Blázquez-García, Angel Conde, Usue Mori, and JoseA Lozano.A review on outlier/anomaly detection in time series data.ACM Computing Surveys (CSUR), 54(3):1–33, 2021.
  • [17]Leman Akoglu.Anomaly mining: Past, present and future.In Proceedings of the 30th ACM International Conference onInformation & Knowledge Management, pages 1–2, 2021.
  • [18]EdwinM Knorr and RaymondT Ng.Finding intensional knowledge of distance-based outliers.In Vldb, volume99, pages 211–222, 1999.
  • [19]Barbora Micenková, RaymondT Ng, Xuan-Hong Dang, and Ira Assent.Explaining outliers by subspace separability.In 2013 IEEE 13th international conference on data mining,pages 518–527. IEEE, 2013.
  • [20]Ninghao Liu, Donghwa Shin, and Xia Hu.Contextual outlier interpretation.arXiv preprint arXiv:1711.10589, 2017.
  • [21]Meghanath Macha and Leman Akoglu.Explaining anomalies in groups with characterizing subspace rules.Data Mining and Knowledge Discovery, 32:1444–1480, 2018.
  • [22]Nikhil Gupta, Dhivya Eswaran, Neil Shah, Leman Akoglu, and Christos Faloutsos.Beyond outlier detection: Lookout for pictorial explanation.In Machine Learning and Knowledge Discovery in Databases:European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018,Proceedings, Part I 18, pages 122–138. Springer, 2019.
  • [23]Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, and Dominik Janzing.Causal structure-based root cause analysis of outliers.In Kamalika Chaudhuri, Stefanie Jegelka, LeSong, Csaba Szepesvari,Gang Niu, and Sivan Sabato, editors, Proceedings of the 39thInternational Conference on Machine Learning, volume 162 of Proceedingsof Machine Learning Research, pages 2357–2369. PMLR, 17–23 Jul 2022.
  • [24]Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu,and Haifeng Chen.Interdependent causal networks for root cause localization.In Proceedings of the 29th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining, pages 5051–5060, 2023.
  • [25]Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, andDan Pei.Causal inference-based root cause analysis for online service systemswith intervention recognition.In Proceedings of the 28th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining, pages 3230–3240, 2022.
  • [26]Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, and Haifeng Chen.Incremental causal graph learning for online root cause analysis.In Proceedings of the 29th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining, pages 2269–2278, 2023.
  • [27]Pengfei Chen, Yong Qi, Pengfei Zheng, and DiHou.Causeinfer: Automatic and distributed performance diagnosis withhierarchical causality graph in large distributed systems.In IEEE INFOCOM 2014 - IEEE Conference on ComputerCommunications, pages 1887–1895, 2014.
  • [28]JinJin Lin, Pengfei Chen, and Zibin Zheng.Microscope: Pinpoint performance issues with causal graphs inmicro-service environments.In International Conference on Service Oriented Computing,2018.
  • [29]Michaela Hardt, William Orchard, Patrick Blöbaum, Shiva Kasiviswanathan, andElke Kirschbaum.The petshop dataset – finding causes of performance issues acrossmicroservices, 2023.
  • [30]YuGan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou.Sage: practical and scalable ml-driven performance debugging inmicroservices.In Proceedings of the 26th ACM International Conference onArchitectural Support for Programming Languages and Operating Systems,ASPLOS ’21, page 135–151, New York, NY, USA, 2021. Association forComputing Machinery.
  • [31]Azam Ikram.Sock-shop data.https://github.com/azamikram/rcd/tree/master/sock-shop-data,2023.
  • [32]Amin Jaber, Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim.Causal discovery from soft interventions with unknown targets:Characterization and learning.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin,editors, Advances in Neural Information Processing Systems, volume33,pages 9551–9561. Curran Associates, Inc., 2020.
  • [33]Nicola Gnecco, Nicolai Meinshausen, Jonas Peters, and Sebastian Engelke.Causal discovery in heavy-tailed models.The Annals of Statistics, 49(3):1755 – 1778, 2021.
  • [34]Carlos Améndola, Benjamin Hollering, Seth Sullivant, and Ngoc Tran.Markov equivalence of max-linear Bayesian networks.In Cassio deCampos and MarloesH. Maathuis, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in ArtificialIntelligence, volume 161 of Proceedings of Machine Learning Research,pages 1746–1755. PMLR, 27–30 Jul 2021.
  • [35]J.Pearl and J.Mackenzie.The book of why.Basic Books, USA, 2018.
  • [36]J.Pearl.Causality.Cambridge University Press, 2000.
  • [37]Kun Zhang, Zhikun Wang, Jiji Zhang, and Bernhard Schölkopf.On estimation of functional causal models: general results andapplication to the post-nonlinear causal model.ACM Transactions on Intelligent Systems and Technology (TIST),7(2):1–22, 2015.
  • [38]Julius VonKügelgen, Abdirisak Mohamed, and Sander Beckers.Backtracking counterfactuals.In Mihaela vander Schaar, Cheng Zhang, and Dominik Janzing, editors,Proceedings of the Second Conference on Causal Learning and Reasoning,volume 213 of Proceedings of Machine Learning Research, pages 177–196.PMLR, 11–14 Apr 2023.
  • [39]A.Balke and J.Pearl.Counterfactual probabilities: Computational methods, bounds, andapplications.In R.Lopez D.Mantaras and D.Poole, editors, Uncertainty inArtifical Intelligence, volume10. Morgan Kaufmann, San Mateo, 1994.
  • [40]Akash Maharaj, Ritwik Sinha, David Arbour, Ian Waudby-Smith, SimonZ. Liu,Moumita Sinha, Raghavendra Addanki, Aaditya Ramdas, Manas Garg, andViswanathan Swaminathan.Anytime-valid confidence sequences in an enterprise a/b testingplatform.In Companion Proceedings of the ACM Web Conference 2023, WWW’23 Companion, page 396–400, New York, NY, USA, 2023. Association forComputing Machinery.
  • [41]Dominik Janzing, Patrick Blöbaum, AtalantiA Mastakouri, PhilippM Faller,Lenon Minorics, and Kailash Budhathoki.Quantifying intrinsic causal contributions via structure preservinginterventions.In International Conference on Artificial Intelligence andStatistics, pages 2188–2196. PMLR, 2024.
  • [42]Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, SébastienLachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal.A meta-transfer objective for learning to disentangle causalmechanisms.arXiv preprint arXiv:1901.10912, 2019.
  • [43]B.Schölkopf, F.Locatello*, S.Bauer, N.R. Ke, N.Kalchbrenner, A.Goyal,and Y.Bengio.Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021.
  • [44]Judea Pearl.Sufficient causes: On oxygen, matches, and fires.Journal of Causal Inference, 7(2), 2019.
  • [45]Dominik Janzing, Kailash Budhathoki, Lenon Minorics, and Patrick Blöbaum.Causal structure based root cause analysis of outliers.arxiv:1912.02724, 2019.
  • [46]R.A. Fisher.Statistical Methods for Research Workers.Oliver and Boyd, 1925.
  • [47]Patrick Blöbaum, Peter Götz, Kailash Budhathoki, AtalantiA. Mastakouri, andDominik Janzing.Dowhy-gcm: An extension of dowhy for causal inference in graphicalcausal models, 2022.

Appendix A Proofs

A.1 Proof of Proposition2.1

With the event E𝐸Eitalic_E as in Eq.3 and Cπ(j)superscript𝐶𝜋𝑗C^{\pi}(j)italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_j ) as in Eq.8we have

Cπ(j)=logP(E|𝐍π>j{j}=𝐧π>j{j})P(E|𝐍π>j=𝐧π>j)=𝑖logP(E|𝐗π>j{j}=𝐱π>j{j})P(E|𝐗π>j=𝐱π>j)=iilogP(E|do(𝐗π>j{j}=𝐱π>j{j})P(E|do(𝐗π>j=𝐱π>j)).C^{\pi}(j)=\log\frac{P(E|{\bf N}_{\pi>j~{}\cup\left\{j\right\}}={\bf n}_{\pi>j%~{}\cup\left\{j\right\}})}{P(E|{\bf N}_{\pi>j}={\bf n}_{\pi>j})}\overset{i}{=}%\log\frac{P(E|{\bf X}_{\pi>j~{}\cup\left\{j\right\}}={\bf x}_{\pi>j~{}\cup%\left\{j\right\}})}{P(E|{\bf X}_{\pi>j}={\bf x}_{\pi>j})}\overset{ii}{=}\log%\frac{P(E|do({\bf X}_{\pi>j~{}\cup\left\{j\right\}}={\bf x}_{\pi>j~{}\cup\left%\{j\right\}})}{P(E|do({\bf X}_{\pi>j}={\bf x}_{\pi>j}))}.italic_C start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_j ) = roman_log divide start_ARG italic_P ( italic_E | bold_N start_POSTSUBSCRIPT italic_π > italic_j ∪ { italic_j } end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT italic_π > italic_j ∪ { italic_j } end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E | bold_N start_POSTSUBSCRIPT italic_π > italic_j end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT italic_π > italic_j end_POSTSUBSCRIPT ) end_ARG overitalic_i start_ARG = end_ARG roman_log divide start_ARG italic_P ( italic_E | bold_X start_POSTSUBSCRIPT italic_π > italic_j ∪ { italic_j } end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_π > italic_j ∪ { italic_j } end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E | bold_X start_POSTSUBSCRIPT italic_π > italic_j end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_π > italic_j end_POSTSUBSCRIPT ) end_ARG start_OVERACCENT italic_i italic_i end_OVERACCENT start_ARG = end_ARG roman_log divide start_ARG italic_P ( italic_E | italic_d italic_o ( bold_X start_POSTSUBSCRIPT italic_π > italic_j ∪ { italic_j } end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_π > italic_j ∪ { italic_j } end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E | italic_d italic_o ( bold_X start_POSTSUBSCRIPT italic_π > italic_j end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_π > italic_j end_POSTSUBSCRIPT ) ) end_ARG .(20)

(i) and (ii) are seenas follows:

P(E|𝐍)=P(E|𝐗))=P(E|do(𝐗)).P(E|{\bf N}_{\mathcal{I}})=P(E|{\bf X}_{\mathcal{I}}))=P(E|do({\bf X}_{%\mathcal{I}})).italic_P ( italic_E | bold_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ) = italic_P ( italic_E | bold_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ) ) = italic_P ( italic_E | italic_d italic_o ( bold_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ) ) .(21)

The first equality in Eq.21 followsfrom XnI|𝐍perpendicular-tosubscript𝑋𝑛conditionalsubscript𝐼subscript𝐍X_{n}\perp I_{\mathcal{I}}\,|{\bf N}_{\mathcal{I}}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟂ italic_I start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT | bold_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT and because𝐗subscript𝐗{\bf X}_{\mathcal{I}}bold_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT is a function of 𝐍subscript𝐍{\bf N}_{\mathcal{I}}bold_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT.The second one follows because conditioning on all ancestors blocks all backdoor paths. Note that since π𝜋\piitalic_π is a reverse topological ordering of the nodes all π>j𝜋𝑗\pi>jitalic_π > italic_j are ancestors of j𝑗jitalic_j.

A.2 Proof of Lemma3.1

C(|𝒮¯)𝐶conditional¯𝒮\displaystyle C(\mathcal{R}|\bar{{\mathcal{S}}})italic_C ( caligraphic_R | over¯ start_ARG caligraphic_S end_ARG )=logq(𝒮)logq(𝒮)absent𝑞𝒮𝑞𝒮\displaystyle=\log q({\mathcal{S}}\cup\mathcal{R})-\log q({\mathcal{S}})= roman_log italic_q ( caligraphic_S ∪ caligraphic_R ) - roman_log italic_q ( caligraphic_S )
=(logq(𝒮)logq(𝒮{j1}))+(logq(𝒮{j1})logq(𝒮{j1,j2}))+absent𝑞𝒮𝑞𝒮subscript𝑗1𝑞𝒮subscript𝑗1𝑞𝒮subscript𝑗1subscript𝑗2\displaystyle=\left(\log q({\mathcal{S}}\cup\mathcal{R})-\log q({\mathcal{S}}%\cup\mathcal{R}\setminus\left\{j_{1}\right\})\right)+\left(\log q({\mathcal{S}%}\cup\mathcal{R}\setminus\left\{j_{1}\right\})-\log q({\mathcal{S}}\cup%\mathcal{R}\setminus\left\{j_{1},j_{2}\right\}\right))+\ldots= ( roman_log italic_q ( caligraphic_S ∪ caligraphic_R ) - roman_log italic_q ( caligraphic_S ∪ caligraphic_R ∖ { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ) ) + ( roman_log italic_q ( caligraphic_S ∪ caligraphic_R ∖ { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ) - roman_log italic_q ( caligraphic_S ∪ caligraphic_R ∖ { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) ) + …
+(logq(𝒮{jk})logq(𝒮))=i=1kC(ji|𝒮{j1,,ji}¯).𝑞𝒮subscript𝑗𝑘𝑞𝒮superscriptsubscript𝑖1𝑘𝐶conditionalsubscript𝑗𝑖¯𝒮subscript𝑗1subscript𝑗𝑖\displaystyle+\left(\log q({\mathcal{S}}\cup\left\{j_{k}\right\})-\log q({%\mathcal{S}})\right)=\sum_{i=1}^{k}C(j_{i}|\overline{{\mathcal{S}}\cup\mathcal%{R}\setminus\{j_{1},\dots,j_{i}\}}).+ ( roman_log italic_q ( caligraphic_S ∪ { italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ) - roman_log italic_q ( caligraphic_S ) ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_C ( italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over¯ start_ARG caligraphic_S ∪ caligraphic_R ∖ { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_ARG ) .

A.3 Proof of Proposition3.2

By definition we have that

C(R|S¯)=logP(E|𝐍R=𝐧R,𝐍S=𝐧S)P(E|𝐍S=𝐧S),𝐶conditional𝑅¯𝑆𝑃formulae-sequenceconditional𝐸subscript𝐍𝑅subscript𝐧𝑅subscript𝐍𝑆subscript𝐧𝑆𝑃conditional𝐸subscript𝐍𝑆subscript𝐧𝑆C(R|\bar{S})=\log\frac{P(E|{\bf N}_{R}={\bf n}_{R},{\bf N}_{S}={\bf n}_{S})}{P%(E|{\bf N}_{S}={\bf n}_{S})},italic_C ( italic_R | over¯ start_ARG italic_S end_ARG ) = roman_log divide start_ARG italic_P ( italic_E | bold_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , bold_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E | bold_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG ,

and we use P(E|𝐧S)𝑃conditional𝐸subscript𝐧𝑆P(E|{\bf n}_{S})italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) instead of P(E|𝐍S=𝐧S)𝑃conditional𝐸subscript𝐍𝑆subscript𝐧𝑆P(E|{\bf N}_{S}={\bf n}_{S})italic_P ( italic_E | bold_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) when it is clear from the context.

Further,C(R|S¯)𝐶conditional𝑅¯𝑆C(R|\bar{S})italic_C ( italic_R | over¯ start_ARG italic_S end_ARG ) is actually a function of 𝐧Rsubscript𝐧𝑅{\bf n}_{R}bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and 𝐧Ssubscript𝐧𝑆{\bf n}_{S}bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.For fixed 𝐧Ssubscript𝐧𝑆{\bf n}_{S}bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, define the set B:={𝐧R|C(R|S¯)α}assign𝐵conditional-setsubscript𝐧𝑅𝐶conditional𝑅¯𝑆𝛼B:=\{{\bf n}_{R}\,|C(R|\bar{S})\geq\alpha\}italic_B := { bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_C ( italic_R | over¯ start_ARG italic_S end_ARG ) ≥ italic_α }.It can equivalently be described by

B={𝐧R|logP(E|𝐧R,𝐧S)P(E|𝐧S)α}={𝐧R|P(E|𝐧R,𝐧S)P(E|𝐧S)eα}.𝐵conditional-setsubscript𝐧𝑅𝑃conditional𝐸subscript𝐧𝑅subscript𝐧𝑆𝑃conditional𝐸subscript𝐧𝑆𝛼conditional-setsubscript𝐧𝑅𝑃conditional𝐸subscript𝐧𝑅subscript𝐧𝑆𝑃conditional𝐸subscript𝐧𝑆superscript𝑒𝛼B=\{{\bf n}_{R}\,|\log\frac{P(E|{\bf n}_{R},{\bf n}_{S})}{P(E|{\bf n}_{S})}%\geq\alpha\}=\{{\bf n}_{R}\,|P(E|{\bf n}_{R},{\bf n}_{S})\geq P(E|{\bf n}_{S})%\cdot e^{\alpha}\}.italic_B = { bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | roman_log divide start_ARG italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG ≥ italic_α } = { bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ≥ italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ⋅ italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT } .

We thus have

P(E|𝐧RB,𝐧S)P(E|𝐧S)eα.𝑃conditional𝐸subscript𝐧𝑅𝐵subscript𝐧𝑆𝑃conditional𝐸subscript𝐧𝑆superscript𝑒𝛼P(E|{\bf n}_{R}\in B,{\bf n}_{S})\geq P(E|{\bf n}_{S})\cdot e^{\alpha}.italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_B , bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ≥ italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ⋅ italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT .

Hence,

P(E,𝐧RB|𝐧S)P(𝐧RB|𝐧S)P(E|𝐧S)eα.𝑃𝐸subscript𝐧𝑅conditional𝐵subscript𝐧𝑆𝑃subscript𝐧𝑅conditional𝐵subscript𝐧𝑆𝑃conditional𝐸subscript𝐧𝑆superscript𝑒𝛼\frac{P(E,{\bf n}_{R}\in B|{\bf n}_{S})}{P({\bf n}_{R}\in B|{\bf n}_{S})}\geq P%(E|{\bf n}_{S})\cdot e^{\alpha}.divide start_ARG italic_P ( italic_E , bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_B | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_B | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG ≥ italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ⋅ italic_e start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT .

Using

P(E|𝐧S)P(E,𝐧RB|𝐧S)𝑃conditional𝐸subscript𝐧𝑆𝑃𝐸subscript𝐧𝑅conditional𝐵subscript𝐧𝑆P(E|{\bf n}_{S})\geq P(E,{\bf n}_{R}\in B|{\bf n}_{S})italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ≥ italic_P ( italic_E , bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_B | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT )

we obtain

P(𝐧RB|𝐧S)eα.𝑃subscript𝐧𝑅conditional𝐵subscript𝐧𝑆superscript𝑒𝛼P({\bf n}_{R}\in B|{\bf n}_{S})\leq e^{-\alpha}.italic_P ( bold_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_B | bold_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ≤ italic_e start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT .

A.4 Proof of Lemma3.3

C(𝒮¯|𝒮)=logq(𝒮¯)q()=𝑖logq(𝒮¯)+S(xn)𝐶conditional¯𝒮𝒮𝑞¯𝒮𝑞𝑖𝑞¯𝒮𝑆subscript𝑥𝑛\displaystyle C(\bar{\mathcal{S}}\,|\,{\mathcal{S}})=\log\frac{q(\bar{\mathcal%{S}})}{q(\emptyset)}\overset{i}{=}\log q(\bar{\mathcal{S}})+S(x_{n})italic_C ( over¯ start_ARG caligraphic_S end_ARG | caligraphic_S ) = roman_log divide start_ARG italic_q ( over¯ start_ARG caligraphic_S end_ARG ) end_ARG start_ARG italic_q ( ∅ ) end_ARG overitalic_i start_ARG = end_ARG roman_log italic_q ( over¯ start_ARG caligraphic_S end_ARG ) + italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
=logP(E|𝐧𝒮¯)+S(xn)absent𝑃conditional𝐸subscript𝐧¯𝒮𝑆subscript𝑥𝑛\displaystyle=\log P(E\,|\,{\bf n}_{\bar{{\mathcal{S}}}})+S(x_{n})= roman_log italic_P ( italic_E | bold_n start_POSTSUBSCRIPT over¯ start_ARG caligraphic_S end_ARG end_POSTSUBSCRIPT ) + italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
P(E|𝐧𝒮¯)=eC(𝒮¯|𝒮)S(xn)=iieC(S|S¯).absent𝑃conditional𝐸subscript𝐧¯𝒮superscript𝑒𝐶conditional¯𝒮𝒮𝑆subscript𝑥𝑛𝑖𝑖superscript𝑒𝐶conditional𝑆¯𝑆\displaystyle\Rightarrow P(E\,|\,{\bf n}_{\bar{{\mathcal{S}}}})=e^{C(\bar{{%\mathcal{S}}}|{\mathcal{S}})-S(x_{n})}\overset{ii}{=}e^{-C(S|\bar{S})}.⇒ italic_P ( italic_E | bold_n start_POSTSUBSCRIPT over¯ start_ARG caligraphic_S end_ARG end_POSTSUBSCRIPT ) = italic_e start_POSTSUPERSCRIPT italic_C ( over¯ start_ARG caligraphic_S end_ARG | caligraphic_S ) - italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_OVERACCENT italic_i italic_i end_OVERACCENT start_ARG = end_ARG italic_e start_POSTSUPERSCRIPT - italic_C ( italic_S | over¯ start_ARG italic_S end_ARG ) end_POSTSUPERSCRIPT .

were (i)𝑖(i)( italic_i ) follows fromS(xn)=logP(E)𝑆subscript𝑥𝑛𝑃𝐸S(x_{n})=-\log P(E)italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = - roman_log italic_P ( italic_E ) and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from the fact that S(xn)=iC(i|.)S(x_{n})=\sum_{i}C(\mathcal{I}_{i}\,|\,.)italic_S ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | . ) as long as 𝑖i=𝒰𝑖subscript𝑖𝒰\underset{i}{\cup}~{}\mathcal{I}_{i}=\mathcal{U}underitalic_i start_ARG ∪ end_ARG caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_U.

A.5 Proof of Lemma3.5

The proof simply follows by using Hoeffding’s bound3.5 together with union bound, if k𝑘kitalic_k data points are used for estimation of each P(E|𝐧Iπ<i)𝑃conditional𝐸subscript𝐧superscript𝐼𝜋𝑖P(E\,|\,{\bf n}_{I^{\pi<i}})italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), we have based on Hoeffding’s bound for each i𝑖iitalic_i that:

P(|P^(E|𝐧Iπ<i)P(E|𝐧Iπ<i)|<δ)>12e2kδ2,\displaystyle P\left(|\hat{P}(E\,|\,{\bf n}_{I^{\pi<i}})-P(E\,|\,{\bf n}_{I^{%\pi<i}})|<\delta\right)>1-2e^{-2k\delta^{2}},italic_P ( | over^ start_ARG italic_P end_ARG ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - italic_P ( italic_E | bold_n start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_π < italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | < italic_δ ) > 1 - 2 italic_e start_POSTSUPERSCRIPT - 2 italic_k italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ,

From union bound we have that it simultaneously holds for all i𝑖iitalic_i with probability at least 12ne2kδ212𝑛superscript𝑒2𝑘superscript𝛿21-2ne^{-2k\delta^{2}}1 - 2 italic_n italic_e start_POSTSUPERSCRIPT - 2 italic_k italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. To have 12ne2kδ2>1α12𝑛superscript𝑒2𝑘superscript𝛿21𝛼1-2ne^{-2k\delta^{2}}>1-\alpha1 - 2 italic_n italic_e start_POSTSUPERSCRIPT - 2 italic_k italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT > 1 - italic_α we should have:

2ne2kδ2<αk>log2nα2δ22𝑛superscript𝑒2𝑘superscript𝛿2𝛼𝑘2𝑛𝛼2superscript𝛿2\displaystyle 2ne^{-2k\delta^{2}}<\alpha\Rightarrow k>\frac{\log\frac{2n}{%\alpha}}{2\delta^{2}}2 italic_n italic_e start_POSTSUPERSCRIPT - 2 italic_k italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT < italic_α ⇒ italic_k > divide start_ARG roman_log divide start_ARG 2 italic_n end_ARG start_ARG italic_α end_ARG end_ARG start_ARG 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

A.6 Proof of Lemma4.2

P{S(Y)S(y)|X=x}𝑃conditional-set𝑆𝑌𝑆𝑦𝑋𝑥\displaystyle P\left\{S(Y)\geq S(y)|X=x\right\}italic_P { italic_S ( italic_Y ) ≥ italic_S ( italic_y ) | italic_X = italic_x }
=\displaystyle==P{S(Y)S(y)|S(X)=S(x)}𝑃conditional-set𝑆𝑌𝑆𝑦𝑆𝑋𝑆𝑥\displaystyle P\{S(Y)\geq S(y)|S(X)=S(x)\}italic_P { italic_S ( italic_Y ) ≥ italic_S ( italic_y ) | italic_S ( italic_X ) = italic_S ( italic_x ) }
\displaystyle\leqP{S(Y)S(y)|S(X)S(x)}𝑃conditional-set𝑆𝑌𝑆𝑦𝑆𝑋𝑆𝑥\displaystyle P\{S(Y)\geq S(y)|S(X)\geq S(x)\}italic_P { italic_S ( italic_Y ) ≥ italic_S ( italic_y ) | italic_S ( italic_X ) ≥ italic_S ( italic_x ) }
\displaystyle\leqP{S(Y)S(y)}P{S(X)S(x))}=eS(y)eS(x).\displaystyle\frac{P\{S(Y)\geq S(y)\}}{P\{S(X)\geq S(x))\}}=\frac{e^{-S(y)}}{e%^{-S(x)}}.divide start_ARG italic_P { italic_S ( italic_Y ) ≥ italic_S ( italic_y ) } end_ARG start_ARG italic_P { italic_S ( italic_X ) ≥ italic_S ( italic_x ) ) } end_ARG = divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_S ( italic_y ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_S ( italic_x ) end_POSTSUPERSCRIPT end_ARG .

The first equality uses injectivity, and the second inequality monotonicity (see Subsection4.1. The third one follows from P(A|B)P(A)/P(B)𝑃conditional𝐴𝐵𝑃𝐴𝑃𝐵P(A|B)\leq P(A)/P(B)italic_P ( italic_A | italic_B ) ≤ italic_P ( italic_A ) / italic_P ( italic_B ) for any two events A,B𝐴𝐵A,Bitalic_A , italic_B.

A.7 Proof of Lemma4.3

Assume without loss ofgenerality that X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\dots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTis a topological ordering of the DAG. ThenS(Xi|X1,,Xi1)=S(Xi|Pai)𝑆conditionalsubscript𝑋𝑖subscript𝑋1subscript𝑋𝑖1𝑆conditionalsubscript𝑋𝑖𝑃subscript𝑎𝑖S(X_{i}|X_{1},\dots,X_{i-1})=S(X_{i}|Pa_{i})italic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) = italic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) holds due to the local Markov condition.Since S(Xj|Paj=paj)𝑆conditionalsubscript𝑋𝑗𝑃subscript𝑎𝑗𝑝subscript𝑎𝑗S(X_{j}|Pa_{j}=pa_{j})italic_S ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_p italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) has densityessuperscript𝑒𝑠e^{-s}italic_e start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPT for all pai𝑝subscript𝑎𝑖pa_{i}italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, alsoS(Xi|X1=x1,,Xi1=xi1)𝑆formulae-sequenceconditionalsubscript𝑋𝑖subscript𝑋1subscript𝑥1subscript𝑋𝑖1subscript𝑥𝑖1S(X_{i}|X_{1}=x_{1},\dots,X_{i-1}=x_{i-1})italic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) has density essuperscript𝑒𝑠e^{-s}italic_e start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPTfor all x1,,xi1subscript𝑥1subscript𝑥𝑖1x_{1},\dots,x_{i-1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Hence, S(Xi|Pai)𝑆conditionalsubscript𝑋𝑖𝑃subscript𝑎𝑖S(X_{i}|Pa_{i})italic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is independent of X1,,Xi1subscript𝑋1subscript𝑋𝑖1X_{1},\dots,X_{i-1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT.

A.8 Proof of Lemma4.4

The proof follows from a minor modification of the proof of Lemma 1 in [45] by replacing n𝑛nitalic_n with n1𝑛1n-1italic_n - 1.

A.9 Proof of Theorem4.5

The theorem is a direct result of Lemma4.3 and Lemma4.4.

A.10 Proof of Theorem4.6

P(Ei|Pai=pai)𝑃conditionalsubscript𝐸𝑖𝑃subscript𝑎𝑖𝑝subscript𝑎𝑖\displaystyle P(E_{i}|Pa_{i}=pa_{i})italic_P ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )\displaystyle\leqP(Ei|EPai1,,EPaik)𝑃conditionalsubscript𝐸𝑖subscript𝐸𝑃superscriptsubscript𝑎𝑖1subscript𝐸𝑃superscriptsubscript𝑎𝑖𝑘\displaystyle P(E_{i}|E_{Pa_{i}^{1}},\dots,E_{Pa_{i}^{k}})italic_P ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_E start_POSTSUBSCRIPT italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
\displaystyle\leqP(Ei)P(EPai1,,EPaik)𝑃subscript𝐸𝑖𝑃subscript𝐸𝑃superscriptsubscript𝑎𝑖1subscript𝐸𝑃superscriptsubscript𝑎𝑖𝑘\displaystyle\frac{P(E_{i})}{P(E_{Pa_{i}^{1}},\dots,E_{Pa_{i}^{k}})}divide start_ARG italic_P ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E start_POSTSUBSCRIPT italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG
\displaystyle\leqP(Ei)P(EPai1)P(EPaik).𝑃subscript𝐸𝑖𝑃subscript𝐸𝑃superscriptsubscript𝑎𝑖1𝑃subscript𝐸𝑃subscriptsuperscript𝑎𝑘𝑖\displaystyle\frac{P(E_{i})}{P(E_{Pa_{i}^{1}})\cdots P(E_{Pa^{k}_{i}})}.divide start_ARG italic_P ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_E start_POSTSUBSCRIPT italic_P italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⋯ italic_P ( italic_E start_POSTSUBSCRIPT italic_P italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG .

Recalling that anomaly scores are non-negative, we thus obtain:

S(xi|pai)|S(xi)l=1kS(pail)|+,𝑆conditionalsubscript𝑥𝑖𝑝subscript𝑎𝑖subscript𝑆subscript𝑥𝑖superscriptsubscript𝑙1𝑘𝑆𝑝subscriptsuperscript𝑎𝑙𝑖S(x_{i}|pa_{i})\geq|S(x_{i})-\sum_{l=1}^{k}S(pa^{l}_{i})|_{+},italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_p italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ | italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_S ( italic_p italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ,

with the notation |a|:=(a+|a|)/2assign𝑎𝑎𝑎2|a|:=(a+|a|)/2| italic_a | := ( italic_a + | italic_a | ) / 2.Hence, we obtain a lower bound for scumsubscript𝑠𝑐𝑢𝑚s_{cum}italic_s start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT:

scumij|S(xi)lS(pail)|+=:s^cum.s_{cum}\geq\sum_{i\neq j}|S(x_{i})-\sum_{l}S(pa^{l}_{i})|_{+}=:\hat{s}_{cum}.italic_s start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT ≥ ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_S ( italic_p italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = : over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_m end_POSTSUBSCRIPT .(22)

Appendix B Experimental details and further experiments

B.1 Experimental details

To generate an SCM for the experiments in Section.5 (see Fig.1), we first uniformly sample between 10 and 20 root nodes (20% to 40% of the total nodes of the graph) and uniformly assign to each either a standard Gaussian, uniform or mixture of Gaussians as its noise distribution.As a second step, we recursively sample non-root nodes.Non-root nodes need not be sink nodes.The number of parent nodes that each non-root node is conditioned on is randomly chosen following a distribution that assigns a lower probability to a higher number of parents.In total, the causal graph is composed of 50 nodes.The parametric forms of the structural equations are randomly assigned to be either a simple feed-forward neural network with a probability of 0.8 (to account for non-linear models) and a linear model.The feed-forward neural network has three layers (input layer, hidden layer, and output layer) where the hidden layer has a number of nodes chosen randomly between 2 and 100.All the parameters of the neural network are sampled from a uniform distribution between -5 and 5.For the linear model, we sample the coefficients of the linear model from a uniform distribution between -1 and 1 and set the intercept to 0.In both cases, we use additive Gaussian noise as the relation between the noise and the variables.

To generate data for the non-anomalous regime we sample the noise of each of the variables and propagate the noise forward using the previously sampled structural equations.As mentioned in the main text, to produce anomalous data we choose a root cause at random from the list of all nodes and a target node from the set of its descendants (including itself).Then we sample the noise of all variables and modify the noise of the root cause by adding x𝑥xitalic_x standard deviations, where x{2,2.1,,3}𝑥22.13x\in\{2,2.1,\dots,3\}italic_x ∈ { 2 , 2.1 , … , 3 }, and propagate the noises through the SCM to obtain a realization of each variable.We repeat this process 100 times for each x𝑥xitalic_x value added to the standard deviation and consider the algorithm successful if its result coincides with the chosen root cause.

Root Cause Analysis of Outliers with Missing Structural Knowledge (2)

B.2 Further experiments by varying the number of nodes in the graph

We also run experiments by fixing the number of standard deviations added (x𝑥xitalic_x in Section.5) to 3 and varying the number of nodes in the SCM.The number of nodes we run the experiments on is {20,40,60,80,100}20406080100\{20,40,60,80,100\}{ 20 , 40 , 60 , 80 , 100 }.We see in Fig.3 that the performance for Traversal, SMOOTH TRAVERSAL, and Counterfactual does not change much for different graph sizes, whereas the quality of TOCA and SCORE ORDERING decreases slightly for larger graph sizes.Circa, on the other hand, has worse performance for an intermediate number of nodes but the quality increases as the graph gets larger.

Root Cause Analysis of Outliers with Missing Structural Knowledge (3)

Fig.4 shows the runtime of all the algorithms for an SCM with 100 variables (and added standard deviation 3).The only qualitative change between this Figure and Fig.2 is that SCORE ORDERING is slightly faster than TOCA.

Root Cause Analysis of Outliers with Missing Structural Knowledge (4)

B.3 PetShop dataset

[29] introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. In addition to the approaches evaluated by the authors, reproduced below, we evaluated our algorithms in both top-1 recall (Table.1) and top-3 recall (Table.2).

graph givengraph not giventhis paper
trafficmetrictraversalcircacounter-ϵitalic-ϵ\epsilonitalic_ϵ-diagnosisrcdcorrelationscoreTOCAsmooth
scenariofactualorderingtraversal
lowlatency0.570.360.360.000.070.430.290.000.21
lowavailability0.500.420.000.000.580.750.420.000.33
highlatency0.570.500.570.000.000.640.140.000.14
highavailability0.330.000.000.000.000.830.170.000.00
temporallatency1.000.750.380.120.250.620.250.000.25
temporalavailability0.380.380.000.000.500.620.380.000.13

graph givengraph not giventhis paper
trafficmetrictraversalcircacounter-ϵitalic-ϵ\epsilonitalic_ϵ-diagnosisrcdcorrelationscoreTOCAsmooth
scenariofactualorderingtraversal
lowlatency0.570.860.710.000.210.570.710.210.36
lowavailability1.001.000.420.000.750.920.920.000.67
highlatency0.791.000.860.000.070.790.570.000.57
highavailability1.000.000.000.330.000.920.170.000.67
temporallatency1.001.000.500.120.750.750.750.000.75
temporalavailability1.001.000.251.000.120.750.750.000.50

Root Cause Analysis of Outliers with Missing Structural Knowledge (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Terence Hammes MD

Last Updated:

Views: 5748

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.