Can we measure what we cannot observe?

It is a difficult task to think of any data-driven discipline that does not encounter data that may exist, but were not detected. When applied to environmental data the concept is often referred to as the presence of undetectables or nondetects, whether at the low end or high end of a calibration curve. In astrophysics, where such data have been studied for over 100 yeas, the anomaly is called Malmquist Bias.

Zack Florence
8 min readDec 11, 2023

Background and concepts

Analytically monitoring environmental pollutants to assess our air, water, soil, foods and many other media is ever more important. Our global community has expanded the boundaries of cities and consistently added newly synthesized chemicals to our environment. As well, while our lives usually occupy a fixed time and space, we are challenged to identify and measure pollutants which may be of local, regional or distant origins, i.e. long distance dispersal. Regulatory agencies have created monitoring networks and set safety guidelines. When pollutants are analyzed and found above the guideline (right tail of the distribution), or below, (left tail of the distribution) beyond the range of analytical methods, a “detection limit (DL)” would be exceeded: some literature refers to such observations as “censored data”. At the lower bound, the quantity of a pollutant is assumed to be bounded by 0 and the DL; data exceeding the upper bound are ≥ DL, upward to what might we considered the maximum possible. Whether designing a monitoring network, are a manufacturing process, factors introducing variability will include the analytical equipment that is available, the analyst who must calibrate the equipment, the detection limit (s) and the lower and upper limits of quantitation. So here is the essential question: at what quantity do the data become inaccurate and uncertainties increase (below)?

Source: https://www.researchgate.net/publication/221925958_Strategies_and_Challenges_in_Measuring_Protein_Abundance_Using_Stable_Isotope_Labeling_and_Tandem_Mass_Spectrometry/figures?lo=1

Left-tailed data: “. … a detection limit is a threshold below which measured values are not considered significantly different from a “blank” (distilled water are a defined solute: author’s comment) signal, at a specified level of probability” (Helsel p. 9–22). Right-tailed data are more common in medical and industrial settings. Right-tailed limits of detection, for example, could be time to death, time to observing a significant effect caused by a drug, strength testing construction cement, or non-exceedance bounds on municipal water quality analytes, e.g. coliform counts, heavy metals, and more.

The references at the end of this essay are all good sources of theory and applications.

Sampling designs

Gathering and analysing data are expensive, e.g. how many samples should be taken, where and when to sample and maintaining a chain of custody are a few factors of the factors built into a monitoring design.

  1. The source(s) of contamination and layout of the physical landscape (media: air, soil, water, sewage) are important. The project manager must decide whether the layout should be a grid (square or hexagonal, possibly circular), transect or other configuration, and the size of the area to be sampled. “Hotspots”, within and among locations, may be statistically important. Identifying nondetects can help define boundaries of contamination as concentrations decrease, or increase. Sample sizes and distributions of data will determine the types of statistical methods used, e.g. parametric or nonparametric.
  2. Sampling air quality (the airshed) will depend a lot on sampling at the point of emissions, directional ordinates associated with prevailing wind and distance of drift. An example below shows the decline in NO2 (nitrogen dioxide) concentrations with distance from vehicle emissions on a motorway (freeway) in Shanghai. Accuracy and precision are dependent upon the season and distance. Obviously, more nondetects may be observed as distance increases from the source and seasons: winter having the highest concentrations and summer the lowest. Concentrations are expected to decline as an inverse square-power law.
Reference: Xiaodong Zoua, et al. (2006). see References.

3. Lakes, rivers, streams (water shed), may have point-sources that need sampling at the effluent source; lakes can be subject to runoff from terrestrial drainage where agricultural chemicals and livestock wastes enter the aquatic food chain. Municipal drinking water from rivers, streams and reservoirs will require sampling based upon surveying the landscape which can be quite variable. From the author’s experience there are often multiple municipalities on one water source so that each city will require readings above its intake, and when it enters the distribution system following treatment at municipal water (sewage) treatment plant(s); regulations require sampling downstream of the city. In Canada, the Provinces and Federal government are responsible for regulating water quality when rivers/streams cross provincial boundaries. That means a monitoring station on each side of the Border.

4. Ground water (aquifer) offers its own sampling and regulatory challenges. Depth to the aquifer, its geographic area, subsurface flow, is it static or dynamic, and very import, what are the sources of potential contamination. An aquifer may cross more than one jurisdiction. The map below shows the Ogallala Aquifer (which the author was dependent upon as a young person) that sustains agriculture, energy production, humans and livestock across several states of the central Great Plains of the United States. Obviously contaminants may stem from agricultural chemicals, livestock, oil and gas operations and municipal sewage systems, to mention only some of the most obvious sources.

Source: https://en.wikipedia.org/wiki/Ogallala_Aquifer

Theory and methods for analysing environmental data

Refer to these: Gilbert 1987; Helsel 2005, see also Shoari and Dube (2018) for a helpful review of literature and methods.

There are “messy data”, then there are environmental monitoring data (author’s opinion)! Environmental assessment data must be treated with the same care given any data in which the public (our biosphere), may be at risk when observed values are too high, or too low: examples are at this link describing safety guidelines and target values for air quality health across Canada; the US Environmental Protection Agency (EPA) is responsible for guidelines among the 50 states (and Territories).

Most often a normal (Gaussian) distribution is assumed: the common objective will be to estimate a mean concentration and precision relative to a regulatory guideline: can the effect of a pollutant be detected that violates the regulatory limit? Skewed data will introduce bias in the parametric estimates. Data not meeting that assumption, or fitting some other parametric distribution, are today more often analyzed using a nonparametric, i.e. distribution free method.

Four things to consideration when analysing environmental data having nondetects: you might think of others.

  1. Substitution, e.g. insert the midpoint between 0 and the DL (easiest but introduces bias).
  2. Maximum likelihood estimation: best applied when N≥ 25–50.
  3. Nonparametric methods: few assumptions about the data distribution and can better accommodate small data sets.
  4. Good advice: DO NOT FIX YOUR DATA, FIX YOUR APPROACH (Shoari and Dube 2018).

Sampling for other than estimates of central tendency and variability could include tests for trends, quantiles, percent nondetects: undetected data contain lots of information, e.g. detection is impossible because a pollutant concentration is very low, or a tumour is not identified due to its small size and/or anomalous density. Too high values can point to an upset in a chemical or manufacturing process requiring rapid correction. Our highest risk is that of a “false negative”, i.e. declaring a value within our guidelines but, in fact, it is not. Such a mistake could jeopardise health and public safety or the health of natural habitats.

Managing risks: guard against decision errors, particularly “false negatives”

What if an analyte exceeds the DL but it was not detected? What if the home Covid-19 test you used indicated no virus, but in truth, you are infected and contagious?

Refer to the decision matrix below, and consider these assumptions when data are in lower end of the calibration curve: The null hypothesis is (Ho): y≤ DL vs. the alternative (Ha): y> DL. Type 1 would be if we concluded that y≥ DL when in fact y<DL.

Our overriding goal is to protect especially against a Type 1 and Type 2 errors, shown below: the power of the test is often set at P≥0.80.

Prepared by the author.

The power of a project is largely dependent upon the variance in the data and the size of the effect we wish to detect. Power of the test would be best applied when a larger number of samples (e.g., over a network) are involved or more sophisticated sampling simulations may be used.

Malmquist Bias

In 2022 I sent a question to an astrophysicist, Prof. Scott Hughes, with this question: Do space scientists have similar sampling issues as Earth-bound scientists when analyzing environmental data?

This was his reply.

“You are absolutely correct that there will be objects that were missed previously but will be seen now! In astronomy, this phenomenon of data being skewed by the “non detects” goes under the name of Malmquist Bias, and has been studied for almost 100 years. (There’s a quite decent Wikipedia article on it: https://en.wikipedia.org/wiki/Malmquist_bias) In the case of things we look at with telescopes, it tells us that data of the most distant objects in the universe will be dominated by the things that happen to be the brightest. As we improve technology, things that were previously below threshold will now be above threshold, and indeed lots of previously undetected “stuff” will pop out.” (personal communication via email).

Luminosity and distance are the sources of Malmquist Bias

The image shown below provides a quick study of the concept of Malmquist Bias. Stars with luminosity below the average for visible stars will not be seen and represent censored data. The advanced capabilities of the James Webb Space Telescope (JWST) will make it possible to observe stars that were once below the detection limit.

An example of Malmquist Bias: In a volume of space filled with stars, the stars have a range of luminosities and have some average luminosity shown by the dashed blue line. However, the more distant, less luminous stars will not be seen. This lowest luminosity that can be seen at a given distance is depicted by the red curve. Any star below that curve will not be seen: the bias (the author). The average luminosity of only the stars that will be seen will be higher, as shown by the dashed red line.

Comparison of space telescopes

While the image below was published in 2021, before the James Webb Space Telescope (JWST) was commissioned in 2022, it is an excellent example of why, like environmental monitoring laboratories, detection limits may vary amongst most of these space observatories. Note that the JWST is designed to lie within the IR (infrared) category. Imagine the variation in distance capabilities, within our solar system, our galaxy and far beyond.

Give it some thought: You may now be able to think of examples of nondetectable data that were not included in the article. I hope that you now have a better appreciation for the challenges faced when making informed decisions when data do not conform to common assumptions.

References

Gilberta, Richard O. (1987). Statistical Methods for Environmental Pollution Monitoring. (Pub.) Van Nostrand Reinhold. p. 320.

Helsel, Dennis R. (2005). Nondetects and data analysis. (Pub.) John Wiley and Sons, Inc. p. 250.

N. Shoari and J-S. Dube (2018). Environmental Toxicology and Chemistry — Volume 37, Number 3 — pp. 643–656.

Xiaodong Zoua, et al. (2006). Shifted power-law relationship between NO2 concentration and the distance from a highway: A new dispersion model based on the wind profile model. Atmospheric Environment 40 (2006) 8068–8073.

Acknowledgements

I am grateful to Prof. Scott Hughes for his helpful comments and for introducing the author to “Malmquist Bias”.

Thank you for reading my article. If you liked it, give me a “clap”. Leave comments if you wish.

Zack Florence: My knowledge is a work in progress. For more about me click on this: https://sites.google.com/view/zackflorencebiosketch/

--

--