← Back to Main Page ... Go to Next Page (Experimental Results)
We test four datasets in anti-causal settings, each designed to evaluate different aspects of ACIA: discrete vs. continuous labels and perfect vs. imperfect interventions.
Digit labels cause specific image features: colors (environment). This synthetic dataset tests ACIA's ability to separate digit classification from spurious color correlations.
Digit labels cause specific image features: rotations (environment). This dataset evaluates robustness to rotational transformations as environmental factors.
A physical simulation environment where ball positions (continuous labels) cause pixel observations, with controlled interventions affecting object dynamics. This dataset tests ACIA with continuous labels and imperfect interventions.
A real medical dataset where tumor presence (label) causes tissue patterns in pathology images, with hospital-specific staining protocols creating environmental variations. This real-world dataset validates ACIA's practical applicability.
We use four metrics to measure predictive performance and causal properties:
Fraction of test samples correctly predicted by our predictor. Higher values indicate better performance.
Measures the degree to which high-level representations remain independent of environment-specific information while preserving label-relevant information. Specifically, we compute mutual information between high-level representations and environment labels, conditioned on class labels. We weight them by class frequency and calculate their summation. Lower values indicate better environment independence.
Quantifies stability of low-level representations across environments. We measure the variance of representations across different environments and calculate their average across feature dimensions. Lower values indicate greater invariance.
Evaluates model robustness under interventions by comparing the difference between observational and interventional distributions. Specifically, we first obtain probability confidence scores for original and intervened samples, and then calculate KL divergence between these distributions. Lower values indicate higher robustness.
We compare ACIA against 10 baseline methods spanning three main categories: