The Data Science Behind an Address
What is an address? Most people will respond that an address is where they live, where they work, or where they go in their spare time. An address is what you write on an envelope you’re mailing, or where your online orders are shipped to. An address is a geographic place and an identity.
Address data has different meanings to different organizations, depending on how they incorporate address data into their workflows. I’ve been building address data for a long time, but whenever I think of an address, my brain immediately associates it with static, physical locations. However, for many organizations, the concept of an address has changed greatly in the last decade to not only include physical street locations but also consumer digital addresses, like emails and social handles. Moving forward the data science behind address data needs to bridge the gap between physical and digital addresses.
There is a lot that goes into building address data and continuously maintaining it to keep up with real world change. Street address components often change due to new infrastructure, subdivision of land parcels, and redrawn postcode boundaries. Many times address entities are renamed to honor influential people, like Martin Luther King Jr. Addresses often change even though the location persists.
In addition, the same location could have multiple vanity, alias, or abbreviated addresses associated to it. The following three addresses are all valid, representing the same location, despite having different city names:
- 3001 Abby Way, Loveland, OH 45140 (primary city name)
- 3001 Abby Way, Montgomery, OH 45140 (secondary city name)
- 3001 Abby Way, Murdock, OH 45140 (vanity city name)
Addresses also have relationships to each other. Apartment buildings and office complexes often have one main address for the entire structure, while each unit possesses its own related address. In the world of address data science, these address relationships are known as genealogy, with parent addresses representing the main structure and child addresses denoting those within it.
Curating high quality address data
As the saying goes, bad data in means bad data out. For address analytics to work properly, underlying reference data must account for these complexities. When I build address data, I look for completeness, correctness, currency, and coverage.
Address formats and data elements vary tremendously across the globe. For an address to be both complete and correct, it needs all of the required data elements for its particular geography. An incomplete address can greatly degrade performance and skew the results of an analysis.
Separate from that is the issue of the currency (freshness) of the address data itself. Due to real-world changes such as E-911, annexations, or even countrywide addressing changes, the address data that was current yesterday could be outdated today. While it looks complete or correct from the formatting perspective, it’s no longer current or usable. Having good systems to maintain the currency of address data is critical as currency issues can be stealthy and not obvious to the end user.
Lastly, the coverage of the address data is another aspect to consider. Entire countries or regions could be missing, ultimately affecting the results of analytics. Since the real world universe of addresses is huge, a spatially-representative, statistically-significant sample set can be used as a proxy for reality to confirm geographic coverage is up to expectations.
The concept of an address has changed greatly in the last decade to not only include physical street locations but also consumer digital addresses, like emails and social handles. Now the data science behind address data needs to bridge the gap between physical and digital addresses. Learn more about Precisely's Address Fabric dataset and download a free sample.
Operationalizing an address
After mastering these complexities, data scientists operationalize addresses to derive business insights and influence strategic decisions. This can take many forms. Geocoding assigns a latitude and longitude coordinate pair to an address, allowing data scientists to analyze its exact geographic location. Address validation tools ingest addresses and return a standardized, verified version of the address. Both geocoding and address validation incorporate the address data complexities listed above to ensure analytics of the highest quality.
Precisely’s Address Fabric dataset also uses address data science to enable analytics, but instead of an application like geocoding or address validation, it provides the most comprehensive address list for the United States and Canada in a flat file. Each record is pre-geocoded to provide the most accurate latitude and longitude coordinate location for that address, and appended with a unique and persistent identifier, the PreciselyID. The PreciselyID persists when address elements change, reveals address relationships, and represents one single identifier for locations with multiple alias addresses. This facilitates more efficient data management, exchange, and enrichment. Address Fabric is unique because it combines the geographic coordinates used by a geocoder with the address elements that power address validation systems, all in one flat file dataset that can be used in any database or analytics environment.
The PreciselyID opens up a world of possibilities for data enrichment. As a unique and persistent identifier, the PreciselyID enables organizations to append contextual information to an address. Appended data could be related to the demographic profile of the area, the physical characteristics of the property, or social handles associated to that location. The PreciselyID can bridge the gap between physical and digital addresses, and unlock deeper insights for more detailed analytics.
An address is many things, and more than just what you use to send a piece of mail. To learn more about Address Fabric and download a free sample, visit the Precisely Data Experience.