Global ETD Search

Return to search

New Opportunities in Crowd-Sourced Monitoring and Non-government Data Mining for Developing Urban Air Quality Models in the US

Ambient air pollution is among the top 10 health risk factors in the US. With increasing concerns about adverse health effects of ambient air pollution among stakeholders including environmental scientists, health professionals, urban planners and community residents, improving air quality is a crucial goal for developing healthy communities. The US Environmental Protection Agency (EPA) aims to reduce air pollution by regulating emissions and continuously monitoring air pollution levels. Local communities also benefit from crowd-sourced monitoring to measure air pollution, particularly with the help of rapidly developed low-cost sampling technologies. The shift from relying only on government-based regulatory monitoring to crowd-sourced effort has provided new opportunities for air quality data. In addition, the fast-growing data sciences (e.g., data mining) allow for leveraging open data from different sources to improve air pollution exposure assessment. My dissertation investigates how new data sources of air quality (e.g., community-based monitoring, low-cost sensor platform) and model predictor variables (e.g., non-government open data) based on emerging modeling approaches (e.g., machine learning [ML]) could be used to improve air quality models (i.e., land use regression [LUR]) at local, regional, and national levels for refined exposure assessment.

LUR models are commonly used for predicting air pollution concentrations at locations without monitoring data based on neighboring land use and geographic variables. I explore the use of crowd-sourced low-cost monitoring data, new/open dataset from government and non-government sponsored platforms, and emerging modeling techniques to develop LUR models in the US. I focus on testing whether: (1) air quality data from community-based monitoring is feasible for developing LUR models, (2) air quality data from non-government crowd-sourced low-cost sensor platforms could supplement regulatory monitors for LUR development, and (3) new/open data extracted from non-government sponsored platforms could serve as alternative datasets to traditional predictor variable sources (e.g., land use and geographic features) in LUR models.

In Chapter 3, I developed LUR models using community-based sampling (n = 50) for 60 volatile organic compounds (VOC) in the city of Minneapolis, US. I assessed whether adding area source-related features improves LUR model performance and compared model performance using variables featuring area sources from government vs. non-government sponsored platforms. I developed three sets of models: (1) base-case models with land use and transportation variables, (2) base-case models adding area source variables from local business permit data (government sponsored platform), and (3) base-case models adding Google point of interest (POI) data for area sources. Models with Google POI data performed the best; for example, the total VOC (TVOC) model had better goodness-of-fit (adj-R2: 0.56; Root Mean Square Error [RMSE]: 0.32 µg/m3) as compared to the permit data model (0.42; 0.37) and the base-case model (0.26; 0.41). This work suggests that VOC LUR models can be developed using community-based samples and adding Google POI could improve model performance as compared to using local business permit data.

In Chapter 4, I evaluated a national LUR model using annual average PM2.5 concentrations from low-cost sensors (i.e., PurpleAir platform) in 6 US urban areas (n = 149) and tested the feasibility of using low-cost sensor data for developing LUR models. I compared LUR models using only the PurpleAir sensors vs. hybrid LUR models (combining both the EPA regulatory monitors and the PurpleAir sensors). I found that the low-cost sensor network could serve as a promising alternative to fill the gaps of existing regulatory networks. For example, the national regulatory monitor-based LUR (i.e., CACES LUR developed as part of the Center for Air, Climate, and Energy Solutions) may fail to capture locations with high PM2.5 concentrations and the within-city spatial variability. Developing LUR models using the PurpleAir sensors was reasonable (PurpleAir sensors only: 10-fold CV R2 = 0.66, MAE = 2.01 µg/m3; PurpleAir and regulatory monitors: R2 = 0.85, MAE = 1.02 µg/m3). I also observed that incorporating PurpleAir sensor data into LUR models could help capture within-city variability and merit further investigation on areas of disagreement with the regulatory monitors. This work suggests that the use of crowd-sourced low-cost sensor networks for LUR models could potentially help exposure assessment and inform environmental and health policies, particularly for places (e.g., developing countries) where regulatory monitoring network is limited.

In Chapter 5, I developed national LUR models to predict annual average concentrations of 6 criteria pollutants (NO2, PM2.5, O3, CO, SO2 and PM10) in the US to compare models using new data (Google POI, Google Street View [GSV] and Local Climate Zone [LCZ]) vs. traditional geographic variables (e.g., road lengths, area of built land) based on different modeling approaches (partial least square [PLS], stepwise regression and machine learning [ML] with and without Kriging effect). Model performance was similar for both variable scenarios (e.g., random 10-fold CV R2 of ML-kriging models for NO2, new vs. traditional: 0.89 vs. 0.91); whereas adding the new variables to the traditional LUR models didn't necessarily improve model performance. Models with kriging effect outperformed those without (e.g., CV R2 for PM2.5 using the new variables, ML-kriging vs. ML: 0.83 vs. 0.67). The importance of the new variables to LUR models highlights the potential of substituting traditional variables, thus enabling LUR models for areas with limited or no data (e.g., developing countries) and across cities.

The dissertation presents the integration of new/open data from non-government sponsored platform and crowd-sourced low-cost sensor networks in LUR models based on different modeling approaches for predicting ambient air pollution. The analyses provide evidence that using new data sources of both air quality and predictor variables could serve as promising strategies to improve LUR models for tracking exposures more accurately. The results could inform environment scientists, health policy makers, as well as urban planners interested in promoting healthy communities. / Doctor of Philosophy / According to the US Centers for Disease Control and Prevention (CDC), a healthy community aims at preventing disease, reducing health gaps, and creating more accessible options for a wider population. Outdoor air pollution has been evidenced to cause a wide range of diseases (e.g., cardiovascular diseases, respiratory diseases, diabetes and adverse birth outcome), ranking as the top 10 health risks in the US. Thus, improving understanding of ambient air quality is one of the common goals among environmental scientists, urban planners, health professionals, and local residents to achieving healthy communities.

To understand air pollution exposures in different areas, US Environmental Protection Agency (EPA) has regulatory monitors for outdoor air pollution measurements across the country. For locations without these regulatory monitors, land use regression (LUR) models (one type of air quality models) are commonly employed to make a prediction. Usually, information including number of people, location of bus stops, and type of roads are shared online from government websites. These datasets are often used as significant predictor variables for developing LUR models. Questions remain on whether new air quality data and alternative land use data from non-government sources could improve air quality modeling. In recent years, local communities have been actively involving in air pollution monitoring using rapidly developed low-cost sensors and sampling campaigns with the help of local residents. In the meantime, advances in data sciences make open data much easier to acquire and use, particularly from non-government sponsored platforms. My dissertation aims to explore the use of new data sources including community-based low-cost monitoring data and open dataset from non-government websites in LUR modes based on emerging modeling techniques (e.g. machine learning) to predict air pollution levels in the US.

I first built LUR models for volatile organic compounds (VOC: organic chemicals with a high vapor pressure at room temperature [e.g., Benzene]) based on community-based sampling data in the City of Minneapolis, US. I added information on number of neighboring gas stations, dry cleaners, paint booths, and auto shops from both the local government and Google website into the model and compared the model performance for both data sources (Chapter 3). Then, I used PM2.5 data from a non-government website (PurpleAir low-cost sensors) for 6 US cities evaluating an existing air quality model that used air quality data from government websites. I further developed LUR models using the PurpleAir PM2.5 data to see whether this non-government source of low-cost sensor data could be as reasonable as the government data for LUR model development. I finally extracted new/open data from non-government sponsored platforms (e.g., Google products and local climate zone [LCZ: a map that describes the development patterns of land, such as high-rise vs. low-rise or trees vs. sands]) in the US to investigate if these data sources can be used to alternate the land use and geographic data often used in national LUR model development.

I found that: (1) adding information (e.g., number of neighboring gas stations) from non-government sponsored sources (e.g., Google) could improve the air quality model performance for VOCs, (2) integrating non-government low-cost PM2.5 sensor data into government regulatory monitoring data to develop LUR models could improve model performance and offer more insights on the air pollution exposure, (3) new/open data from non-government sponsored platforms could be used to replace the land use and geographic data previous obtained from government websites for air quality models. These findings mean that air quality data and street-level land use characteristics could serve as alternative data sources and are capable of developing better air quality models for promoting healthy communities.

Hazardous air pollutants

volunteer-based monitoring

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/106545
Date	15 May 2020
Creators	Lu, Tianjun
Contributors	Public Administration/Public Affairs, Hankey, Steven C., Zhang, Wenwen, Marr, Linsey C., Sforza, Peter M.
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Dissertation
Format	ETD, application/pdf, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0039 seconds

New Opportunities in Crowd-Sourced Monitoring and Non-government Data Mining for Developing Urban Air Quality Models in the US

Description

Links & Downloads

Tags

Additional Fields