Data Analysis

What was found using the datasets?

NYC Dataset

The NYC dataset can be found on NYC Open Data, which hosts many different public datasets on NYC. I decided to choose this dataset as it was able to let me perfectly graph out what I had wanted, which was to showcase each borough and its recycling rates.

The primary analysis for this data set comes from a column called "Capture Rate-Total ((Total Recycling - Leaves (Recycling)) / (Max Paper + Max MGP))x100," which is ultimately the total of all recycling totals (in tons) within this dataset, over the maximum amount of paper waste + plastics certified as Meets Preferred Guidance, meaning plastics that are able to recycle.

For this dataset, I was able to run my code using the CSV export file that came from NYC Open Data. I read in the file using pandas, and I was able to successfully output data showcasing each borough as the X-axis.

NJ Dataset

Texas Dataset

The Texas Dataset comes from the Texas Open Data services provided to the public via the officials at Texas. This was the most difficult dataset to look for, as I knew I wanted Austin in my project for comparison, but they had very limited datasets in terms of waste and recycling on their Open Data website. I was able to find a dataset on the Texas Open Data website, and on page 81 of the source, there were processing facilities and their total processed municipal solid waste. These facilities also happened to be within the range of Austin, as the Travis County and surrounding counties make up for the pro-recycling capital. My main focus on this dataset was the "2020 Tons" column, as it perfectly provided the numbers I needed to wrap up my comparison and analysis for this project.

Citations

Sources:

https://data.cityofnewyork.us/Environment/Recycling-Diversion-and-Capture-Rates/gaq9-z3hz

https://www.nj.gov/dep/dshw/recycling/stat_links/2018finalreport.pdf

https://www.tceq.texas.gov/downloads/permitting/waste-permits/waste-planning/docs/187-21.pdf

Code Sources:

https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-pandas-dataframe - helped with specifying Trenton and Austin sums

https://www.geeksforgeeks.org/plotting-multiple-bar-charts-using-matplotlib-in-python/ - basic plot formatting

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html - renaming NYC Dataset name

The New Jersey Dataset was found on New Jersey Open Data in the Environmental section. This dataset was a bit older, coming from 2018, but still showcases all towns and regions from New Jersey and their tonnage of waste. I knew going into this dataset that my focus was on Trenton, New Jersey, as this capital city allows for a better comparison when looking at NYC.

For me to focus on Trenton, I also had to focus on Mercer County, which was listed within the "CO" or County column of this dataset. The PDF file gave totals for the waste tonnage of every municipality listed, so I was able to manually extract the totals and put them into a separate CSV file. I use these totals for the calculations made in my code from the "nj_data" dataframe and was able to output Trenton and Mercer County's total waste tonnage.

Techniques Used

The main implementation for my project was reading in CSV files to a pandas dataFrame to compute statistical analysis' and plot diagrams showcasing the numbers of these databases.

I used arrays and for loops to get specific data out of these dataframes, the primary example being the boroughs of NYC from the NYC dataset. The for loop checks every borough in the "Zone" column, and if a borough is not already within an array, it adds the borough to the array until the column runs out of values.

matplotlib.pyplot was used to chart the data in the form of bar charts to create a better understanding as to what data was being shown. I also used numpy as a way to create even spacing within the final graph by checking the length of how many values will show on the x-axis.