Table 1: Summarizing Causal and Non-Causal Invariant Representation Learning Methods
| Method | Anti-causal Structure | SCM Requirements | Imperfect Interventions | Intervention Inference | Nonparametric | High-dim Data | OOD |
|---|---|---|---|---|---|---|---|
| Distribution/Domain-invariant Learning | |||||||
| (C-)ADA | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Domain adaptation | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| DDAIG | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| L2A-OT | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| ERM | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| DOMAINBED | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| StableNet | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| SagNets | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| SWAD | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| FACT | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| Evaluation Protocol | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| Ratatouille | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| XRM | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ |
| FeAT | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| AIA | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| IRM | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Rex | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| CI to Spurious | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| Information Bottleneck | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| CausalDA | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| Transportable Rep | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
| Structure-based Causal Representation Learning | |||||||
| DISRL | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| Causal Disentanglement | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| ICP | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
| ICP for nonlinear | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ |
| Active ICP | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ |
| CSG | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
| LECI | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| KCDC | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| Separation & Risk | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| Intervention-based Causal Learning | |||||||
| Nonparametric ICR | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| General Nonlinear Mixing | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| Weakly supervised | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| iCaRL | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| CIRL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| ICRL | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
| LCA | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| ICA | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| AIT | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| ACIA | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
In Table 1, we broadly summarize causal and non-causal invariant representation learning methods.
Early non-causal methods like (C-)ADA and DDAIG focused on domain adaptation strategies, while more recent works such as FeAT and AIA have developed sophisticated objective functions for distribution shift robustness. Domain adaptation methods, L2A-OT, ERM, DOMAINBED, Adversarial and Pre-training, StableNet, SagNets, SWAD, Theoretical Framework, FACT, Evaluation Protocol, Ratatouille, XRM, IRM, Rex, CI to Spurious, Information Bottleneck, CausalDA, and Transportable Rep also fall under this category as they aim to learn representations invariant across different distributions or domains.
Foundational works in the causal domain include Nonparametric ICR which jointly learns encoders and intervention targets with SCM-based structures and General Nonlinear Mixing which addresses non-linear relationships in latent spaces, which model causal effects through interventions. Weakly supervised methods, iCaRL, CIRL, and LCA also leverage interventions for learning causal representations. ICRL also falls under this category. Weak distributional invariances considers perfect interventions for single-node, and addresses multi-node imperfect interventions by identifying latent variables whose distributional properties remain stable. Independent Component Analysis (ICA) focus on unsupervised identification of latent causal variables through component analysis, and operate the disentanglement through taxonomic distance measures and graph-based analysis. AIT builds on SCMs with explicit DAG assumptions, and primarily focuses on standard causal direction.
Methods like DISRL and Causal Disentanglement explicitly model causal structures. Foundational works like ICP and its nonlinear extension, as well as CSG and LECI which focuses on identifying causal subgraphs while removing spurious correlations, also incorporate causal structure but often require explicit Directed Acyclic Graphs (DAGs) or focus on identifying causal subgraphs. KCDC employs kernel methods primarily for causal discovery and orientation, and focuses on statistical independence tests through kernel measures. Anti-causal separation and risk invariance inputs are generated as functions of target labels and protected attributes. They use conventional causal modeling with DAGs and do-calculus.
Table 2: Key Notations in Anti-Causal Representation Learning Framework
| Name | Symbol | Name | Symbol |
|---|---|---|---|
| Environment | \(e_i\) | Product sample space | \(\Omega = \Omega_{e_i} \times \Omega_{e_j}\) |
| Set of environments | \(\mathcal{E}\) | Product \(\sigma\)-algebra | \(\mathscr{H} = \mathscr{H}_{e_i} \otimes \mathscr{H}_{e_j}\) |
| Environment sample space | \(\Omega_{e_i}\) | Product probability measure | \(\mathbb{P} = \mathbb{P}_{e_i} \otimes {P}_{e_j}\) |
| \(\sigma\)-algebra on \(\Omega_{e_i}\) | \(\mathscr{H}_{e_i}\) | Product causal kernel family | \(\mathbb{K} = \{K_S : S \in \mathscr{P}(T)\}\) |
| Probability measure | \(\mathbb{P}_{e_i}\) | Input space | \(\mathcal{X}\) |
| Causal kernel | \(K_{e_i}\) | Low-level latent space domain | \(\mathcal{D}_{\mathbb{Z}_L}\) |
| Environment causal space | \((\Omega_{e_i}, \mathscr{H}_{e_i}, \mathbb{P}_{e_i}, K_{e_i})\) | High-level latent space | \(\mathcal{D}_{Z_H}\) |
| Causal product space | \((\Omega, \mathscr{H}, \mathbb{P}, \mathbb{K})\) | Label space | \(\mathcal{Y}\) |
| Sub-\(\sigma\)-algebra | \(\mathscr{H}_S\) | Low-level representation | \(\phi_L: \mathcal{X} \rightarrow \mathcal{D}_{\mathbb{Z}_L}\) |
| Index set | \(T = T_{e_i} \cup T_{e_j}\) | High-level representation | \(\phi_H: \mathcal{D}_{\mathbb{Z}_L} \rightarrow \mathcal{D}_{Z_H}\) |
| Interventional kernel | \(K_S^{do(\mathcal{X}, \mathbb{Q})}(\omega, A)\) | Predictor | \(\mathcal{C}: \mathcal{D}_{Z_H} \rightarrow \mathcal{Y}\) |
| Intervention measure | \(\mathbb{Q}(\cdot|\cdot)\) | Full predictive model | \(f = \mathcal{C} \circ \phi_H \circ \phi_L\) |
| Marginal measure on \(\mathcal{Y}\) | \(\mu_Y\) | Loss function | \(\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}_+\) |
| Causal dynamic | \(\mathcal{Z}_L = \langle \mathcal{X}, \mathbb{Q}, \mathbb{K}_L\rangle\) | Environment independence reg. | \(R_1\) |
| Causal abstraction | \(\mathcal{Z}_H = \langle \mathbf{V}_H, \mathbb{K}_H \rangle\) | Causal structure alignment reg. | \(R_2\) |
| Set of low-level kernels | \(\mathbb{K}_L = \{K_S^{\mathbb{Z}_{L}}(\omega, A)\}\) | Regularization parameters | \(\lambda_1, \lambda_2\) |
| Set of high-level kernels | \(\mathbb{K}_H = \{K_S^{\mathbb{Z}_{H}}(\omega, A)\}\) | Conditional mutual information | \(I(X; E=e \mid Y)\) |
Table 3 compares the prior knowledge requirements across state-of-the-art causal representation learning methods. Our results demonstrate that despite requiring less prior information, ACIA outperforms these methods.
Table 3: Comparison of prior information requirements across causal representation learning methods
| Method | Causal Structure | SCM Knowledge | Intervention Type |
|---|---|---|---|
| ACTIR | Anti-causal \(Y \rightarrow X \leftarrow E\) | Variable roles only | Perfect only |
| CausalDA | DAG structure | Variable types | Perfect only |
| LECI | Partial connectivity | Variable relationships | Perfect only |
| ACIA | \(\mathbf{Anti-causal\ Y \rightarrow X \leftarrow E}\) | \(\mathbf{Variable\ roles\ only}\) | \(\mathbf{Both\ perfect\ and\ imperfect}\) |