Which of the following are strategies for mitigating the effects of confounding variables?

Which of the following are strategies for mitigating the effects of confounding variables?

Under a Creative Commons license

Open access

Highlights

Hierarchical confounder effects in complex datasets can bias ML models

RTG scoring identifies confounding variables in real-world biomedical data

RTG scoring can be used to guide model debiasing and inform experimental design

The bigger picture

The promise of using machine learning (ML) to extract insights from high-dimensional data is tempered by the frequent presence of confounding variables. For example, models attempting to identify biomarkers of disease can be severely biased by disease-irrelevant features, such as the physical site where an experiment is performed. While we have many tools to grapple with known confounders, we lack a general method to identify which of a set of potential confounders warrant debiasing. Here, we present a simple non-parametric statistical method called the rank-to-group (RTG) score, which identifies hierarchical confounder effects in raw data and ML-derived data embeddings. We show that RTG scoring identifies previously unreported effects of experimental design in a public dataset and uncovers cross-model correlated variability in a multi-phenotypic biological dataset. This approach should be of general use in experiment-analysis cycles and to ensure confounder robustness in ML models.

Summary

The promise of machine learning (ML) to extract insights from high-dimensional datasets is tempered by confounding variables. It behooves scientists to determine if a model has extracted the desired information or instead fallen prey to bias. Due to features of natural phenomena and experimental design constraints, bioscience datasets are often organized in nested hierarchies that obfuscate the origins of confounding effects and render confounder amelioration methods ineffective. We propose a non-parametric statistical method called the rank-to-group (RTG) score that identifies hierarchical confounder effects in raw data and ML-derived embeddings. We show that RTG scores correctly assign the effects of hierarchical confounders when linear methods fail. In a public biomedical image dataset, we discover unreported effects of experimental design. We then use RTG scores to discover crossmodal correlated variability in a multi-phenotypic biological dataset. This approach should be generally useful in experiment-analysis cycles and to ensure confounder robustness in ML models.

Keywords:

machine learning

debiasing

confounders

hierarchical confounders

stem cell biology

experimental design

robustness

bias

Mann-Whitney U test

Data Science Maturity

DSML 3: Development/Pre-production: Data science output has been rolled out/validated across multiple domains/problems

Data and code availability

In Figure 5, this paper analyzes existing data available at https://www.rxrx.ai/. Data for Figure 6 have been deposited at Zenodo under https://doi.org/10.5281/zenodo.5893469 and are publicly available as of the date of publication. All original code has been deposited at Zenodo under https://doi.org/10.5281/zenodo.5893469 and is publicly available as of the date of publication. All original code has also been deposited in GitHub at https://github.com/herophilus/rtg_score. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Algorithm 1: RTG score

We start with (1) the items in our dataset xi for i = 1, …, N; (2) an included group variable I whose confounding effect we want to estimate; and (3) possibly an excluded group variable E whose confounding efforts we wish to ensure are not misattributed to I. Using the notation introduced in the main text, we want to calculate the RTG score RTGI or, if E is present, the restricted RTG score RTGI\E. For notational simplicity we use RTGI for both scores here. Each item xi has a label for the included group Ii and—optionally, when excluding another confounder—a label for the excluded group Ei.

We compute UqI, the query score for item q, as follows:

1

Define two sets: Sq consisting of the indices of the items that share the same included group label Iq with item q, and Dq consisting of the indices of the items whose included group labels do not equal Iq. When excluding another potential confounder, the indices of the items that share the same excluded group label Eq with item q are removed from Sq and Dq.

Sq={i:i≠q andIi=Iq(and optionally Ei≠Eq)}Dq ={i:i≠qandIi≠Iq(and optionally Ei≠Eq)}.

2

Then, UqI is defined as the fraction of pairs of items with indices in sets Sq and Dq, where the distance from the query item to the item whose index is in set Sq is less than it is to the item whose index is in set Dq (with half attribution in the case of equal distances): U qI=1|Sq|×|Dq|∑i∈ Sq∑j∈Dq1dxq,xi<dxq,xj+0.5×1dxq,xi=dxq ,xj,

where |Sq| and |Dq| are the number of elements in the sets Sq and Dq, and d(xq,xi) is any chosen distance measure that is valid for comparing items of our dataset.

This definition is identical to the ROC AUC computed when (1) all the items whose indices are in the set Sq∪Dq are sorted by their distance to the query item and labeled with 1 and 0s if their indices are members of Sq and Dq, respectively, and (2) the 1 and 0s are treated as true and false positives as in standard ROC analysis. Thus we can efficiently compute UqI from a single pass through Sq∪Dq after sorting.

The RTG score RTGI is simply the average of the scores UqI over all possible query items in the dataset (i.e., for q = 1, …, N). Of note, for some queries, UqI may be undefined because either set Sq or Dq is empty. These queries are excluded from averaging.

How do you mitigate a confounding variable?

There are several methods you can use to decrease the impact of confounding variables on your research: restriction, matching, statistical control and randomization. In restriction, you restrict your sample by only including certain subjects that have the same values of potential confounding variables.

Which of the following is used to reduce confounding variables?

Answer and Explanation: The correct option is a. randomization. Randomization is a scientific technique by which the effect of confounding can be reduced since confounding cannot be assumed as a constant.

Which technique is used to control for known confounding variables?

Randomization of Experiments Randomization is a technique used in experimental design to give control over confounding variables that cannot (should not) be held constant.

Which of the following is used to reduce the effects of confounding variables in experiments quizlet?

Randomization reduces the effects of confounding variables.