← Back to Main Page ... Go to Next Page (Experimental Results)

Experimental Setup

Datasets and Models

We test four datasets in anti-causal settings, each designed to evaluate different aspects of ACIA: discrete vs. continuous labels and perfect vs. imperfect interventions.

Colored MNIST (CMNIST)

Digit labels cause specific image features: colors (environment). This synthetic dataset tests ACIA's ability to separate digit classification from spurious color correlations.

Rotated MNIST (RMNIST)

Digit labels cause specific image features: rotations (environment). This dataset evaluates robustness to rotational transformations as environmental factors.

Ball Agent

A physical simulation environment where ball positions (continuous labels) cause pixel observations, with controlled interventions affecting object dynamics. This dataset tests ACIA with continuous labels and imperfect interventions.

Camelyon17

A real medical dataset where tumor presence (label) causes tissue patterns in pathology images, with hospital-specific staining protocols creating environmental variations. This real-world dataset validates ACIA's practical applicability.

Evaluation Metrics

We use four metrics to measure predictive performance and causal properties:

1. Test Accuracy (Acc ↑)

Fraction of test samples correctly predicted by our predictor. Higher values indicate better performance.

2. Environment Independence (EI ↓)

Measures the degree to which high-level representations remain independent of environment-specific information while preserving label-relevant information. Specifically, we compute mutual information between high-level representations and environment labels, conditioned on class labels. We weight them by class frequency and calculate their summation. Lower values indicate better environment independence.

3. Low-level Invariance (LLI or $R_1$ ↓)

Quantifies stability of low-level representations across environments. We measure the variance of representations across different environments and calculate their average across feature dimensions. Lower values indicate greater invariance.

4. Intervention Robustness (IR or $R_2$ ↓)

Evaluates model robustness under interventions by comparing the difference between observational and interventional distributions. Specifically, we first obtain probability confidence scores for original and intervened samples, and then calculate KL divergence between these distributions. Lower values indicate higher robustness.

Baseline Methods

We compare ACIA against 10 baseline methods spanning three main categories:

1. Robust Optimization Methods
2. Distribution/Domain-Invariant Learning
3. Causal Representation Learning Methods