Statistical Methods for Big Data

My research uses large datasets that allow us to see patterns of biodiversity at the continental scale. A lot of my work aims to improve our understanding of these datasets and better utilize them to answer ecological questions.

Expanding the utility of jSDMs with conditional prediction

We developed the conditional prediction framework, which can assist management and conservation strategies by improving predictions of species abundances and providing information about the relationships between abundances of co-occuring species. Using the residual covariance matrix from a fitted joint Species Distribution Model (jSDM), conditional prediction produces a species coefficient matrix. This matrix allows co-occuring species to be treated as a second set of predictors to supplement covariates in the model and improve prediction of one or more focal species. Conditioning improves predictions by utilizing unmeasured variation in the environment that is captured in the residual covariance matrix.

We used this framework to understand which species are

at risk of nest parasitism from Brown-headed Cowbirds!

Learn more

Using simulated data, we found that conditioning predictions on co-observed species improves predictions. Predictions benefit more from conditioning when the species are measured on a continuous scale (e.g., biomass) than when the species are measured on a discrete scale (e.g., counts). Residual covariance between species (species occuring together more or less often than expected based on covariates in the model) is required for conditioning to improve predictions.

This manuscript is currently in review; for more information, please email me.

Towards data integration: BBS and eBird

Two of the largest and most widely used bird abundance datasets are the Breeding Bird Survey (BBS), a structured annual survey conducted by the USGS, and eBird, a citizen science project run by the Cornell Lab of Ornithology that collects data from around the world. These two datasets differ in almost every way: eBird observations can be submitted by anyone, anywhere, anytime, whereas BBS observations are made annually by experienced volunteers at specific locations.

I analyzed reporting rates in both datasets to compare how they capture abundance of many bird species. The model accouns for differences in time and space, habitat, and effort, so we can identify differences in counts that come from protocol and observer behavior. In other words, if a BBS count and an eBird checklist were submitted from the same time and place, how different would their counts be?

More species are reported in eBird than in BBS, but counts of each species tend to be lower in eBird than BBS. In particular, common and unpopular species are reported at low rates in eBird.

The figure on the left shows how the observer effect (i.e., reporting rate in eBird compared to BBS) varies across species traits in three different regions in the US. A positive correlation means that species with that particular trait are reported more in eBird than in BBS.

This work was published in Ecological Applications:

Scher, C.L. and Clark, J.S., 2023. Species traits and observer behaviors that bias data assimilation and how to accommodate them. Ecological Applications, 33(3), p.e2815.

Peer-reviewed publications

Peer-reviewed publications

Peer-reviewed publications

Statistical Methods for Big Data

Expanding the utility of jSDMs with conditional prediction

Towards data integration: BBS and eBird