Data Projects

Developing a multi-input model for bird identification using iNaturalist citizen science data

In this individual course project, I developed a joint CNN and linear predictive network architecture to predict the species of a bird from both an image input and image metadata (date and location) and tested whether incorporating higher-level taxonomic information into the model (the Genus of the birds as well as the Species) would improve model performance. I used Grad-CAM to visualize the performance of the model - the pairs of images to the left demonstrate the increased attention paid to relevant features of the birds once the higher level taxonomic information was incorporated into the model (rightmost images) versus what the model paid attention to without that information (leftmost images). My results showed that including image metadata and taxonomic information both increased model performance beyond having an image-only input. 

Spatial Correlation of Time Series: Connecting River Rouge Power Plant Emissions with Local Air Quality 

As part of a larger team project looking at US power plant emissions and air quality outcomes, I performed a case study analysis on a generation facility in Michigan, River Rouge, that was uniquely surrounded by a high number of nearby EPA air quality sensors. For each nearby EPA sensor, I used an auto.arima function to determine the best fit model to the daily averaged SO2 air quality data. I calculated the mean squared predictive error (MSPE) of the model using a rolling cross-validation method for the last 5 days of 2019 and then compared the MSPE from that model to the predictive error of a model of the same order, with the River Rouge power plant SO2 emissions included as an external regressor. The results show mainly improvements on the model’s predictive capability for nearby air quality sensors (in terms of reduction in MSPE with SO2 emissions from the River Rouge plant as an external regressor). Meanwhile, sensors >10 miles away did not show improvements. However, it is clear that distance is not the only factor determining reduction in MSPE, as a linear trend is not evident in close-by points and there is considerable variation in the magnitude of MSPE improvement. 

Link to full analysis writeup

Clustering on multiple country-level variables to investigate worldwide renewable energy capacity trends

As part of another team project, I led clustering and neural network prediction analyses on a dataset compiled by our team, composed of 20 years of country-level data on disasters, renewable energy capacity, demographic & GDP data, climate data and data on electric power consumption and imports. To explore the relationships within this data, I performed an unsupervised clustering analysis with a k-means algorithm based on countries’ average GDP, population and disaster impacts between 2000 and 2020 using an elbow analysis to reveal the optimal number of clusters. The chart above shows the resulting clusters plotted against per capita renewable energy capacity. Through this analysis we identified “cohorts” of similar countries based on the GDP, population and disaster impact variables and were able to identify high and low performers within each cluster in terms of per capita renewable energy capacity. The low performers in this analysis could represent areas where renewable energy capacity could be expanded, based on the levels seen in similar countries. For example, one resulting cluster was characterized by high disaster impacts and low GDP per person. Within that cluster, Brazil and Vietnam lead in terms of per capita renewable energy capacity, while Bangladesh and Afghanistan have the lowest levels of renewable energy capacity by this metric within the cluster and could be considered for targeted investments. 

Creating an interactive data visualization website to increase public knowledge around the violence faced by migrants from Central America to the United States

Together with two other graduate students, I worked to develop interactive data visualizations based on a dataset provided by the world food program using D3 and Svelte. The visualizations are embedded within a "scrolly-telling" narrative structure with the goal of contextualizing the data points alongside journalistic interviews of migrants.

Link to website