The art and science of data unification
Definition: In anatomy, articulation is defined as the process of assembly or the act of joining, in particular bones or movable anatomical segments. In thought, it is more creative and expressive, but conveys logical and concise and unambiguous expression. These are the very same considerations we give to data articulation.
Data Articulation takes information in the form or records (rows and columns) from disparate sources and combines them in a rational way. The process is sometimes known as “merge-purge”, but it is a little more than that. Articulation covers the concepts of attribution, and "hierarchalization" (my made-up word for establishing logical parental relationships as part of normalization), and deduplication, as one important aspect of what I call “data rationalization”.
Attribution is a forensic process of finding causes for apparent outcomes or events (like which campaigns caused a conversion). This concept is very important to marketing and business analysis. It is a necessary part of root-cause analysis. X is a function of A, B, and C. Attribution is enforced with hierarchies.
Hierarchalization is a made-up word for creating logical data hierarchies (such as joining a list of contacts where an account name is shown on the contact list) to an account registry where accounts are defined as organizations. The contact and account list do not share discrete keys or unique identifiers (UID’s) so they must be joined with natural key-making processes… articulation.
Deduplication is also known as merge-purge: combining the attributes of all data replicants while keeping only one representative record. The surviving record is often seen as the parent of the contributing records that were archived as “duplicates”. More importantly however, the survivor record represents the combined “DNA” of its contributing duplicates.
The use case: the most prominent use case for data articulation in busines is the need to “know” everything potentially useful about your marketplace participants: your product/service consumers, prospects (and resellers or partners for more complex business models). The assumption is that relevant data about these marketplace actors is spread across many systems, like: spreadsheets, ERP/financial and billing systems, CRM’s, ecommerce, POS systems, data warehouses, 3rd party SaaS tools, web analytics, social channels, and operations/ SCM or fulfillment applications. Because no business has a single, rational database for all information that could be used to “know” the customer, then in order to derive some sense of a “360’ view” of the customer, you can’t rely on any one application or database. You need something to represent the complete data object across all systems.
This is the case for a customer data platform (CDP). But to arrive at a functioning system that tells the complete story of the customer, you need to articulate…both as an initial effort to launch, and in an ongoing way to maintain the consolidated view regardless of how data is updated or extended across applications and channels.
There are other use cases to be sure, but the CDP is the elephant in the room.
The Process: Articulation involves:
Preparation:
1. Finding representative data sources
2. Evaluating the contributions that can come from these data sourced
3. Reviewing the data structures (schemas) from contributing systems
Execution:
4. Building a common/singular data model
5. Matching records
6. Merging records
7. Building hierarchical keys
Matching occurs by applying one or more of possible approaches (these can be used in evaluation passes or batches)
Precise UID: if there are any unique ID’s shared between systems, this is a theoretically perfect (deductive) matching method requiring an “exact-match” technique.
Match code: the building of a match-key from natural key identifiers by taking data snippets and creating a key that causes match likelihood. Match codes are inductive not deductive and can be evaluated using “exact-match” or fuzzy algorithms.
Fuzzy matching algorithms like Soundex or advanced variations of that predictive evaluation that gives probable matches from your “degrees of confidence”. Fuzzy matching is—by definition, inductive.
A note about matching for real-world marketplace participants. The idea is that there is really, only one participant who is unique to the world, yet many data sets with representative data you need to “know your customer”. You want to end up with one record (or data object) that is descriptive about that target business, person, or household. Matching in any way except using authoritative, trusted deductive ID’s is inductive, and therefore subject to errors. You have the possibility of matching data incorrectly, and if you go to the next step: the merge, you end up corrupting the data somewhat. The analysts face the possibility of false positives that can taint the data. Now the bet is the final data set is better (way better) than the disparate data, so you tolerate some possible error.
But the opposite can also happen. If you are not fault tolerant, you will look for ways to increase the confidence levels of inductive matching and thereby miss records that do or should match. You can’t “see” the articulated data because you are afraid to be wrong. In these cases, your resulting records will be a more porous, in exchange from more certainty. In the end, the person applying the articulation processes is the one doing the adjudication.
The person who makes the decision must be given authority in measure of their amount of responsibility.
Three field-merge functions:
1. Fill/ transfer null values in the model with the best values from all complements.
E.g. one address line is poorly typed and non-standardized. A duplicate has a well formed, validated address. Then replace the original value in the primary with the better value from the duplicate.
2. Add fields to the primary record to create one record with the full data model- singular, (non-logically conflicting) trusted attributes from all counterparts.
E.g. the primary record has a phone number but no email. One of the duplicates has an email but no phone for the same/ matching contact. The surviving record has both fields.
3. Add fields from record complements that contain supplemental values to the primary record
E.g. both the primary and duplicate have phone numbers, but they are different numbers, and you don’t want to be forced to choose on or the other. You need to keep one as the primary phone and the other as the secondary phone.
Note: the hierarchalization is intended to create parent-child relationships (either in one table or several). The objective is to create unique parent record keys and establish an association between parent and child. The process to do this uses #1 above, where the field being filled with data is the child record receiving the parent record ID in a foreign key position.
E.g. If one data set has the accounts with account numbers, and another data set has contacts with account names, you can match them and then transfer the parent ID to the child as a foreign key.
Merging duplicative records involves, removing or archiving the non-primary contributors. Whereas hierarchalization involves keeping all records in a normalized logical data set. In the end, you are left with one canonical record set that is articulation to give you a clear and complete understanding of the marketplace participant.
Ascendant Group Inc. has data science experts who can help you treat your business data toward a singular perspective from distributed information sources. Call us for help in your next project.
Commentaires