← Back to Main Page ... Go to Next Page (Properties)

The Theoretical Framework

Our framework establishes product causal space on measure-theoretic causality to handle anti-causal learning across multiple environments; causal kernel and interventional kernel to characterize anti-causal structures. Building on this foundation, we develop causal dynamics to learn low-level representations that extract anti-causal relationships from the raw data. On top of it, we further develop causal abstractions to learn environment-invariant high-level representations.

Foundation: Product Causal Space

Definition (Product Causal Space)

Given causal spaces $(\Omega_{e_i}, \mathscr{H}_{e_i}, \mathbb{P}_{e_i}, K_{e_i})$ and $(\Omega_{e_j}, \mathscr{H}_{e_j}, \mathbb{P}_{e_j}, K_{e_j})$ for environments $e_i$ and $e_j$, a product causal space is a tuple $(\Omega, \mathscr{H}, \mathbb{P}, \mathbb{K})$ where:

This construction enables joint reasoning across environments while preserving individual causal structures. For analysis of environment subsets, we utilize sub-$\sigma$-algebras:

Definition (Sub-$\sigma$-algebra)

Given a product causal space $(\Omega, \mathscr{H}, \mathbb{P}, \mathbb{K})$, for any subset $S \subseteq T$, the sub-$\sigma$-algebra $\mathscr{H}_S$ is generated by measurable rectangles $A_i \times A_j$ where $A_i \in \mathscr{H}_{e_i}$ and $A_j \in \mathscr{H}_{e_j}$ corresponding to the events in the time indices $S$.

Causal Kernel

Definition (Causal Kernel)

A causal kernel $K_S \in \mathbb{K}_S$ for index set $S \in \mathscr{P}(T)$ is a function $K_S: \Omega \times \mathscr{H} \rightarrow [0,1]$. For fixed $\omega \in \Omega$, $K_S(\omega, \cdot)$ is a probability measure on $(\Omega, \mathscr{H})$, and for fixed $A \in \mathscr{H}$, $K_S(\cdot, A)$ is $\mathscr{H}_S$-measurable, where $\mathscr{H}_S$ is sub-$\sigma$-algebra in $S$.

Intuitively, $K_S(\omega, A)$ is the conditional probability of event $A$ given causal information encoded in $\omega$, restricted to environments indexed by $S$. This enables characterization of anti-causal structures:

Theorem (Anti-Causal Kernel Characterization)

For an anti-causal structure with arbitrary feature space $\mathcal{X}$, label space $\mathcal{Y}$, and environments $\mathcal{E}$, the causal kernel satisfies:

$$K_S(\omega, A) = \int_{\mathcal{Y}} P(X \in A \mid Y=y, E \in S) \, d\mu_Y(y)$$

where $\mu_Y$ is the marginal measure on $\mathcal{Y}$.

This characterization captures how labels $Y$ generate observations $X$ across environment subsets $S$, integrating over all possible label values weighted by their marginal probabilities.

Corollary (Independence Property of Anti-Causal Kernel)

In an anti-causal structure, for any $\omega, \omega' \in \Omega$ with identical $Y$-component and for all $A \in \mathscr{H}_{\mathcal{X}}$, $B \in \mathscr{H}_Y$, $S \in \mathscr{P}(T)$:

$$K_S(\omega, \{A|B\}) = K_S(\omega', \{A|B\})$$

This independence property reveals that conditional kernels depend only on the label $Y$, not on environment-specific information in $\omega$.

Interventional Kernel

We now characterize how interventions modify causal kernels, enabling unified treatment of both perfect and imperfect interventions.

Theorem (Interventional Kernel)

Let $(\Omega, \mathscr{H}, \mathbb{P}, \mathbb{K})$ be a product causal space. For any subset $S \in \mathscr{P}(T)$ and intervention $\mathbb{Q}: \mathscr{H} \times \Omega \rightarrow [0,1]$, there exists a unique interventional kernel:

$$K_S^{do(\mathcal{X}, \mathbb{Q})}(\omega, A) = \int_{\Omega} K_S(\omega, d\omega') \mathbb{Q}(A|\omega')$$

provided that the integral is a Lebesgue integral w.r.t. the measure induced by $K_S(\omega, \cdot)$ on $(\Omega, \mathscr{H})$. In addition, $K_S(\omega, \cdot)$ is $\sigma$-finite for each $\omega \in \Omega$, and $\mathbb{Q}(A|\cdot)$ is $\mathscr{H}$-measurable for each $A \in \mathscr{H}$.

Important: This construction encapsulates both intervention types: hard interventions where $\mathbb{Q}(A|\omega') = \mathbb{Q}(A)$ is constant across $\omega'$, and soft interventions where $\mathbb{Q}(A|\omega')$ varies with $\omega'$.

Corollary (Interventional Kernel Invariance)

In anti-causal structure, interventional kernels satisfy the following invariance criteria:

  1. $K_S^{do(X)}(\omega, \{Y \in B\}) = K_S(\omega, \{Y \in B\})$ for all measurable sets $B \subseteq \mathcal{Y}$, meaning intervening on $X$ does not change the distribution of $Y$.
  2. $K_S^{do(Y)}(\omega, \{X \in A\}) \neq K_S(\omega, \{X \in A\})$ for some measurable sets $A \subseteq \mathcal{X}$, meaning intervening on $Y$ changes the distribution of $X$, which is characteristic of an anti-causal relationship.

Causal Dynamics (Low-Level Representation)

Building on the kernel framework, we now develop our approach to learning low-level representations. We adopt the causal dynamics perspective, which identifies latent causal relationships from observed data under distribution shifts—precisely the setting in anti-causal learning across environments.

Our low-level representation mapping $\phi_L$ implements causal dynamics by learning how labels $Y$ generate observations $X$ while preserving environment-specific information. Unlike traditional approaches that immediately pursue invariance, $\phi_L$ intentionally captures both the causal pathway ($Y \rightarrow X$) and environmental influences ($E \rightarrow X$), providing rich features for subsequent abstraction by $\phi_H$.

Theorem 3 (Causal Dynamics and its Kernel)

Let $\mathcal{X}$ be the input space with $\sigma$-algebra $\mathscr{H}_{\mathcal{X}}$. Given an intervention $\mathbb{Q}$ and measure $\mu$ on the domain $\mathcal{D}_{\mathcal{Z}_L}$, the low-level representation $\mathcal{Z}_L = \langle \mathcal{X}, \mathbb{Q}, \mathbb{K}_L \rangle$ can be constructed with kernel:

$$K_S^{\mathcal{Z}_{L}}(\omega, A) = \int_{\mathcal{D}_{\mathcal{Z}_L}} K_S^{do(\mathcal{X}, \mathbb{Q})}(\omega, A) \, d\mu(z)$$

The set of low-level causal kernels is: $\mathbb{K}_L = \{K_S^{\mathcal{Z}_{L}}(\omega, A): S \in \mathscr{P}(T), A \in \mathscr{H}_{\mathcal{X}}\}$.

The low-level representation mapping is defined as $\phi_L: \mathcal{X} \rightarrow \mathcal{Z}_L$ where $\mathcal{Z}_L$ is established in Theorem 3. We denote $\mathbf{V}_L = \{\phi_L(\mathcal{X}(\omega_j))\}$ as the resulting low-level representations for a set of samples.

Causal Abstraction (High-Level Representation)

Previously, abstraction referred to the process of mapping complex, detailed representations to simpler ones that preserve only the relevant information. In our framework, causal abstraction specifically integrates over the domain of low-level representations to form high-level kernels that capture environment-invariant relationships. This integration serves as an information bottleneck, filtering out environment-specific features while retaining label-relevant causal features.

Theorem 4 (Causal Abstraction and its Kernel)

Let $\mathcal{X}$ be the input space and $\mathscr{H}_{\mathcal{X}}$ be its $\sigma$-algebra. Assume a measure $\mu$ on the domain of low-level representations $\mathcal{D}_{\mathcal{Z}_L}$. Then, the high-level representation $\mathcal{Z}_H = \langle \mathbf{V}_H, \mathbb{K}_H \rangle$ can be constructed with kernel:

$$K_S^{\mathcal{Z}_{H}}(\omega, A) = \int_{\mathcal{D}_{\mathcal{Z}_L}} K_S^{\mathcal{Z}_{L}}(\omega, A) \, d\mu(z)$$

The set of high-level causal kernels is: $\mathbb{K}_H = \{K_S^{\mathcal{Z}_{H}}(\omega, A): S \in \mathscr{P}(T), A \in \mathscr{H}_{\mathcal{X}}\}$.

The high-level representation mapping is defined as $\phi_H: \mathcal{Z}_L \rightarrow \mathcal{Z}_H$ with $\mathcal{Z}_H$ established in Theorem 4. We denote $\mathbf{V}_H = \{\phi_H(\phi_L(\mathcal{X}(\omega_j)))\}$ as the resulting high-level representations for a set of samples.

Objective Function of ACIA

ACIA's objective function bases on the theoretical results above. Specifically, the kernel independence property motivates the environment independence regularizer $R_1$, while the intervention invariance criteria guides the design of the causal structure consistency regularizer $R_2$. The optimization achieves the causal dynamics construction (Theorem 3) for $\phi_L$ and causal abstraction (Theorem 4) for $\phi_H$, ensuring learned representations satisfy the anti-causal structure.

Let $\mathcal{C}$ be a classifier and $\ell$ be a loss function. Our objective function of ACIA is defined as:

$$\min_{\mathcal{C},\phi_L,\phi_H} \max_{e_i \in \mathcal{E}} \Big[ \int_{\Omega} \ell((\mathcal{C} \circ \phi_H \circ \phi_L)(\mathcal{X}(\omega)), Y(\omega)) \, d\mathbb{P}_{e_i}(\omega) + \lambda_1 R_1 + \lambda_2 R_2 \Big]$$

where the regularizers are:

$R_1$ (Environment Independence):

$$R_1 = \sum_{e_i, e_j \in \mathcal{E}, i\neq j} \Big\| \int_{\mathcal{Y}} \int_{\Omega} \phi_H(\phi_L(\mathcal{X}(\omega))) \, d\mathbb{P}_{e_i}(\omega|y) \, d\mu_Y(y) - \int_{\mathcal{Y}} \int_{\Omega} \phi_H(\phi_L(\mathcal{X}(\omega))) \, d{P}_{e_j}(\omega|y) \, d\mu_Y(y) \Big\|_2$$

$R_2$ (Causal Structure Consistency):

$$R_2 = \sum_{e_i \in \mathcal{E}} \Big\| \int_{\mathcal{Y}} y \, d\mathbb{P}_{e_i}(y|\phi_H(\phi_L(\mathcal{X}(\omega)))) - \int_{\mathcal{Y}} y \, dK_{\{e_i\}}^{do(Y)}(\omega, dy) \Big\|_2$$
Remark 1: Minmax Formulation

The minmax formulation enforces worst-case robustness across environments. This formulation is supported by our out-of-distribution (OOD) generalization bound. Without it, the learned representations often fail to disentangle environmental factors effectively, as has been validated in prior work on invariant representation learning (IRM, Rex, VRex).

Remark 2: Regularizers

Our two regularizers $R_1$ and $R_2$ enforce key invariance properties essential for robust anti-causal representation learning:

Theoretical Performance of ACIA

We analyze the theoretical performance of ACIA through convergence property, generalization bound in terms of sample complexity and interventional kernels in the anti-causal setting, and environmental robustness.

Theorem (Convergence of ACIA)

If the loss function $\ell$ is convex and the regularization parameters satisfy $\lambda_1, \lambda_2 = O(1/\sqrt{n})$, where $n$ is the sample size. Then the ACIA optimization problem solved via gradient descent converges with the distance to the optimum bounded by:

$$O\left(\frac{1}{\sqrt{T}}\right) + O\left(\frac{1}{\sqrt{n}}\right)$$

after $T$ iterations.

Theorem (Anti-Causal OOD Generalization Bound)

For optimal representations $\phi_L^*, \phi_H^*$ in an anti-causal setting, i.e., $\phi_H^*(\phi_L^*(\mathcal{X})) \perp E \mid Y$ in all environments $\mathcal{E}$. With probability at least $1-\delta$, for any testing environment $e_{\text{test}}$ with sample size $n_{test}$, its expected empirical loss $\mathbb{\hat{E}}_{e_{\text{test}}}[\ell(f^*)]$ under the optimal predictor $f^* = \mathcal{C} \circ \phi_H^* \circ \phi_L^*$ is bounded:

$$\mathbb{\hat{E}}_{e_{\text{test}}}[\ell(f^*)] \leq \max_{e \in \mathcal{E}} \mathbb{E}_{e}[\ell(f^*)] + O\left(\sqrt{\frac{\log(1/\delta)}{n_{test}}}\right)$$

This guarantees that the performance on unseen test environments cannot be arbitrarily worse than the worst-case performance on training environments, with the gap controlled by the sample size.

Theorem (Environmental Robustness)

Denote the learnt two-level representations by ACIA as $\phi_L^*$ and $\phi_H^*$. Then for any new environment $e_{new}$, the distributional distance $d_{\mathcal{H}}(\mathbb{P}_{e_{new}}, \mathbb{P}_{\mathcal{E}})$ between $\mathbb{P}_{e_{new}}$ and $\mathbb{P}_{\mathcal{E}}$ over the function class $\mathcal{H}$ containing all predictors is bounded by:

$$d_{\mathcal{H}}(\mathbb{P}_{e_{new}}, \mathbb{P}_{\mathcal{E}}) \leq \delta_1 + \delta_2$$

where $\mathbb{P}_{\mathcal{E}} = \frac{1}{|\mathcal{E}|}\sum_{e \in \mathcal{E}} \mathbb{P}_e$ is the mixture distribution of training environments $\mathcal{E}$; $\delta_1$ and $\delta_2$ respectively measure the degree of invariance violation in these conditions:

  1. The high-level representation is environment-independent: $\phi_H(\phi_L^*(\mathcal{X})) \perp E \mid Y$
  2. The low-level representation is invariant: $\Pr(\phi_L^*(\mathcal{X}) \mid Y)$ is constant across environments

Computational Complexity

The objective function can be iteratively solved using stochastic gradient descent with:

where $\epsilon$ is desired precision, $\delta$ is failure probability, and $d$ is dimension of the representation space.