Friday, April 7, 2017

Post 4: Data Normalization, Geocoding, and Error Assessmen


Goals and Objectives
As a continuation of the Sand-mines project begun at the start of the semester, this exercise would serve as a first step to constructing a suitability and risk model for frac sand mining in the Western portions of Wisconsin. As a part of this, data on sand mines needed to be normalized, the mine addresses from the data required geocoding, and the results needed to be compared to known values for these mines, in order to measure error. It was every student's responsibility to complete this for 19 of the mines. Geocoding is the process of matching locations in feature classes to known geographic locations using known addresses. By completing this process, an accurate map of mine locations could be constructed for later analysis.
Methods
The original mines data was first opened within excel. From this file, the 19 mines which were required to be normalized and geocoded personally were removed from the and placed within their own excel table. For each mine, a field was added in the data table for each portion of the complete address entry: PLSS, Street Address, Street (name), Street Type, City, State, and Zip Code. These fields were then populated using the corresponding data taken from the original address field (Figure 1).

Figure 1: A table showing both the unnormalized address entries (Address) and normalized address entries (PLSS, Street Address, Street, Street Type, City, State, Zip Code) for each of the nineteen mines, as organized by the mines' unique IDs. 
This is what is known as data normalizing. This is completed for two primary reasons. In order for ArcGIS to properly analyse address data as proper fields, it must first be broken down into these components, That is because the program cannot compartmentalize the whole automatically. Additionally, not every address field entry is organized in the same way. Some describe their addresses in different orders, while other are missing portions of their address data (ex: PLSS, Zip code, street name, etc). This is due to the initial recording of the data taken by the Wisconsin DNR. It is common for data to not be normalized when it is first received or retrieved from an organization.
Figure 2: The completion message for the geocoding process.
From the message, it can be detemined that fourteen mines
were matched to a known address, one was matched to two
equally likely addresses, and four could not be matched to any
known address in the database.
The data was then added into an ArcMap, along with an Imagery Basemap. After logging into the University Enterprise Account, the geocoding toolbar was activated, and the "Geocode Addresses" tab was selected. The World Geocode Service was chosen as the address locator, the Address input fields were matched to the data table, and OK was selected in the window to start the geocoding. When the geocoding was completed, a message appeared displaying the matched, tied and unmatched addresses from the list of mines (Figure 2). From this, it was determined that fourteen of the mine addresses matched a known location, one was matched to two equally likely candidates, and four could not be matched to any addresses in the database. These would likely need to be manually matched later.
The interactive rematch inspector window was then opened. With this, each of the matches for each mine was inspected to see how close it was to its actual location. As it turned out, all but one of the addresses actually failed to match the location of the mine. Instead, these were geocoded to the center of the town listed in the address of each location. To compensate for this, these addresses were manually matched up with their what was believed to be the actual corresponding mine location, in the interactive rematch window. This was accomplished by using a combination of the known address in a Google maps window, using the ArcMap imagery, and if that failed, finding the location using the PLSS address in conjunction with Wisconsin PLSS Sections and PLSS townships shapefiles. The PLSS address would determine in which subdivision of land (both township and and section). This was especially critical with the addresses that came up as unmatched in the geocoding process, as these only had a listed PLSS address. These steps and tools were used until every address was matched with what was believed to be its corresponding mine. Afterwards, the data was exported as a point shapefile so it could be analysed.

The completed geocoded mines location shapefile was added into a new data frame. The true_mine locations_shapefile was also added to this data frame. Using the Select tool and the a query, only the mines nineteen mines which were a personal responsibility were selected out of the shapefile, using the unique mine ID field. This would allow for the comparison of the personally geocoded locations to what was considered the actual locations. In addition, a merge was completed on all the other students' personal shapefiles for their personally assigned, geocoded mines. These were made available by each student when they completed their geocoding. The list of mines geocoded by each student would have some overlap with others in the class. Thus, they could be compared against one another. Unfortunately, several of the students failed to properly name their mine unique ID field (Mine_Uniqu). To prevent this, a field map was used during the merge in order to correct these errors in naming. In addition, two fields, each in an attribute table of one the the shapefiles, needed to be altered, as they were incorrectly populated with values that prevented the merge (ex: words used to represent a null value for a long integer). Once the merge was completed, the same query and Select tool originally used to find geocoded mines in the true mine locations shapefile matching the mine unique IDs (Mine_Uniqu) of the personally completed nineteens mines was used to find the geocoded results of students who also had completed these nineteen mines.  This created a point shapefile out of only these corresponding nineteen mines from the other students' geocoding results.
With shapefiles of the personal geocoded locations, the true mine locations, and the class geocoded mines for the assigned nineteen mines finally ready, they could require formatting before analysis. Each one was reprojected into the into the NAD 1983 State Plane Wisconsin Central FIPS 4802 projected coordinate system. This was required before analysis, as they were originally projected into a geographic coordinate system that used degrees as its unit of measurement for distance and location. By reprojecting them into a projected coordinate system and changing the data frame to this as well, distances between mine locations could be measured instead in linear meters.

The near tool was used to measure the distance of each personally geocoded mine to the closest "actual" mine location. This was usually the corresponding actual location whose Mine Unique ID field matched each of the geocoded mines. However, this was not the case with one mine and its corresponding actual location. In this case, the Measure tool was used to measure the distance  between the geocoded location and the actual location. This data was then added into an excel table, and several statistical measures (minimum, maximum, mean, median, standard deviation). This would serve as the distance error data between the geocoded locations and the actual locations. In addition, a similar use of the Near tool was used to gather data on the distance between the geocoded mine locations and the corresponding locations geocoded by peers. The closest corresponding mine location of other students was used and recorded, instead of all corresponding distances, as this would provide a sample of the whole that would likely be indicative of the error between the personal locations and those of others. This would also serve to point out any locations from the true locations shapefile that may actually be incorrect. in addition, many students had not completed the geocoding process in the allotted time. As a result, the sample data that could be gathered from others was limited.One corresponding location for each of the nineteen geocoded mines taken from the other students geocoded locations equaled out to half of the points made available by other students.  Several mines had only one corresponding geocoded location completed by other students. In the case of Mine 328, no other student had geocoded this mine's location. Once the distance between each geocoded mine and its closest corresponding geocoded location was collected, it was similarly recorded in a data table as distance error values, with the same statistical measurements being collected. Then, the geocoded mine locations shapefile, the true mine locations shapefile, and the class geocoded locations shapefile of the corresponding nineteen mine locations was used to construct a map to more accurately and efficiently convey distnce between the points.
Results
As seen by the results, the greatest error between the actual location and the geocoded location is
Figure 3: A data table showing the distance error between each geocoded
mine location and both the nearest geocoded location determined by a
classmate and the location determined by the true or given dataset.
27465 meters, the minimum distance error is 148 meters, the average/mean error is 3846 m, median error is 650 m, and the standard deviation is 8049 meters (Figure 3). At first glance this appears to be a huge amount of error. But by comparing the median to the mean, it can be determined that most of the individual error values fall far less of the average error, as the median is less than the mode. Indeed, when looking at both the distance error values and their corresponding locations on a map (Figure 4), most of the error is relatively minor. The large error values can generally be attributed to mistakes made in specific mine identification, while the small error values exist because the geocoded location was placed on the mine's roadside entrance, while the actual locations mark the center of the mine. The physical representation of this minor error is shown by Mine 295 and its corresponding geocoded locations (Figure 4). In addition, the minimum error between the geocoded mines and the closest mine location provided by a peer was 1 meter, the maximum distance error was 56972 meters, the median distance error was 63 meters, the mean distance error was 5740 meters, and the standard deviation was 14662 meters. Once again, the few high error values tend to be over-represented in the mean error value, while both the median error and the visual display of geocoded locations show a relatively minor error in most of the locations. The few with high errors are likely due to the fact that the sample was limited to fairly few points, and either the geocoded location or the other student's geocoded location for these mines was incorrect and/or the only sample point available.
Figure 4: A map depicted the  personally geocoded mine locations, the actual mine locations,
and corresponding geocoded mine locations provided by other students (left), an additional map
depicting the geocoded mine locations of Mine 295 over a imagery basemap (center-right), and
a reference map depicting all the geocoded mine locations in relation to the whole of
Wisconsin (bottom right).
Discussion
Of the distance errors collected, those of great value, resulting in geocoded locations being placed at entirely separate mines, are a form of gross error. They are a result of a mistake or blunder made while either selecting the mine location in the geocoding process, or in one case, recording the address of the mine. In the case of mine 274, the geocoded location was likely placed at the wrong location, as its distance error is roughly 87,000 meters from the actual location, and has the highest distance error away from its corresponding location placed by another student. However, in Mine 247's case, the error likely lies in the fault of whoever originally created the data-table. This is because its distance error from the report actual location is 24,000 meters, while its distance error is from the nearest other student's point is only 9 meters. After reviewing its PLSS address, it is clear that the location where Mine 247 is placed by the "actual" data is not the address marked in the data table. Indeed, it would seem that the location of the mine placed during the geolocating process, both personally and by other students, appears in the correct PLSS address. Because of this, it can be assumed that addresses may have gotten mixed up when this data was first tabulated. Both of these errors can be classified as gross, operational errors, appearing as a result of a large mistake made when either creating the original data-table or when analyzing the data table. Additional, these errors can be referred to as operational, attribute data input error. In other words, errors in the input of attribute data, either in the Address field for Mine 247 when the data was first tabulated, or in Mine 274's, and those like it, error in the input of x and y location during the geocoding process.

In the case of the small amounts of error, like in Mine 295, this is not a result of mistakes and blunders made by the operator or data analyzer. Instead, this error is a result of a combination of systemic and random error. For these locations, all points were correctly placed on the proper mine. However, their specific location at each mine are different. This is a result of the computer generalizing the location to the exact center of the mine, while each student chose what they believed to be the entrance of the mine as its location, under instruction. This personal bias is what's known as systemic error. Random error, the most minor of all, results from the fact that a human being can not be precise enough to manually place the point at the exact entrance to each mine during the geocoding process. This is because a person can only be so precise with the manual placement of points in  geocoding, and the imagery used has a maximum resolution. The bias on placing points and is a type of geographic error resulting from data attribute error, a sources of data automation and compilation error. However, rather than being a sources of operational error (a mistake), it is instead inherent error, which is minor error expected to occur and unavoidable during the process. Finally, the ability to only get so accurate because of human and resolution limitations is inherent image analysis error, or the error which occurs based on the quality of the image and the precision in its analysis.
But what does this all mean for the geocoded mine location data. Points with relatively minor error resulting from inherent random or systemic error are not wrong. This data can be considered correct. However, it is important to remember the source of the error to minimize its concentration later on or to possibly eliminate bias in future data collection. The data that needs to be thrown out or removed is comprised of the points having large amounts of gross operational error. These points are usually drastically off from from their real world numbers. It is critical to avoid using these points in further analysis, as they may lead to a false conclusion that is in actuality far off from what should be supported.
Sources
Hupy, C. (2017) Exercise 6: Data Normalization, Geocoding, and Error Assessment: Sand Mining Suitability Project. Eau Claire, WI.

Mine location data, PLSS townships shapefile, and PLSS Sections shapefile provided by the Wisconsin Department of Natural Resources (2017)

No comments:

Post a Comment