SAP Data Warehouse Cloud (DWC) is a data warehouse that specifically targets business users. DWC does so by focusing on superior ease of use, super-simple data collaboration in DWC spaces, seamless integration of data from both SAP and non-SAP sources and, of course, tight integration with SAP Analytics Cloud. With upcoming SAP Data Marketplace, it will be even simpler to find relevant, accurate and detailed data that can be blended with the inhouse data from customers’ SAP systems.
DWC thus easily enables entirely new use cases like comparing a company’s own salaries with those typical in the job market for comparable roles, comparing own business performance with external economic indicators, enriching one’s business partner data with external ratings or economic performance and many others.
It is fully clear though, that the challenge of data integration does not end once the data is brought from SAP Data Marketplace or custom source systesms into a user’s DWC space. The act of physically joining one’s own and these newly acquired data sources is a challenge in its own right: unless all involved data is superbly clean, any straightforward join between two datasets that belong together “in principle” will regularly suffers some or even many exceptions – be it special characters, typos, duplicates, fuzziness in the data, leading zeros, spelling differences or many more. For these exceptions, decisions need to made by subject matter experts in the business (since they know the data) without the overhead of complex SQL code or long transformation in ETL logic. In our experience, dealing with these exceptions often causes 90% of the effort and data integration costs.
This is exactly where the Intelligent Lookup operator in DWC’s Data Builder comes in and provides users with a dedicated, own data matching environment
- Interactive Environment for User Mappings
Users efficiently create mappings & mapping rules in an intuitive no-code UI.
Results are shown immediately. Data is taken as it is.
- Mapping Automation through Rules
Rules create mappings automatically.
Users can add or rework mappings where required – down to record level
- Mappings need to be created only once
All automated & manual mappings are persisted in a “pairing table”
No further user intervention is required ever again on rows where these mappings are applicable
- Full integration with rest of SAP Data Warehouse Cloud
Harmonization result is a standard entity within DWC’s Data Builder that all modelling entities and consuming tools can leverage
Let’s walk through a simple example end-to-end. Imagine you are a marketeer analyzing sales coming from purchases in the company’s webshop. Your sales department shared the raw data that might look like this
As a first goal, you’d like to analyze the data by continent & also plot it on a map. To that end, you surfed the web and found city information for 100k cities in a Google BigQuery dataset that contains all those columns that you need, esp. continent information and latitude/longitude
So far, so good. Now what is your next step in bringing the two datasets together? You put on your data modelling hat, build a simple view that joins the two datasets on city name, region name (=subdivision in city dataset) and country name, mark the view as Analytical Dataset and expose it for consumption to SAP Analytics Cloud (SAC).
You then proceed to build you story in SAC. But, gosh, when looking at your data, you realize you are in unforseen trouble! The join essentially fails in 80% of your cases to identify the correct record in the cities table: continent = NULL for almost all records!!
Intelligent Lookup solves the problem with an entirely new approach by iteratively building up a match table in the background that keeps track of what entry in the one table matches to what record in the other table.
Adding input & lookup entities
As you see in the video, Intelligent Lookup is a standard operator in DWC that creates an artefact that can subsequently be used anywhere else in DWC, just like a normal Graphical or SQL View. After creating a new Intelligent Lookup, users see a template canvas that they need to fill with life.
Firstly, they identify which is their input dataset. One decision they need to take is the granularity at which matches are made or the foreign key in the input dataset that uniquely identifies a record in the lookup dataset. We also call this column the “pairing column”. I’ll be writing an own blog about this soon. Don’t worry if this feels opaque for now – it’s a rather straightforward concept when explained well.
Next you define the lookup dataset, i.e. the dataset from which you want to blend in data to each and every record of your input dataset. Here, we use the above-mentioned Google BigQuery dataset. We subsequently choose which of its columns shall be added to the input dataset. In the case at hand, we are interested in looking up the continent, latitude and longitude of each and every city and thus pick those as “return columns”
Define your first rule – then deploy & run
Next step of the process is now to define the first matching rule. This looks like a join definition, but you’ll see that it actually is a lot more than this and quite a bit more flexible: Once the two entities are defined, you jump to the rule properties, choose the match strategy (exact or fuzzy) and choose the participating columns.
In our case, we want to exactly compare city to city_name, region to subdivision and finally country to country_name. This is actually exactly what our join operator did above, but you’ll soon see where the difference is.
We now need to deploy & run the operator. In the background, Intelligent Lookup deploys the match table (aka “pairing table”), the respective code snippets that do the match logic as well as the final output artefact that effectively joins over input table, pairing table and lookup table.
After deployment, we run the operator and inspect its results. Not fully unexpectedly, the result is exactly what the join did: ~80% “unmatched” and ~20% “matched”. With matched, this means that the rule could identify which LocationId of the input dataset belongs to which geo_id of the lookup dataset by exactly comparing those 3 fields. As with the join, that comparison proved to be unsuccessful for ~80% of the records
Adding rule #2 to work through all records that did not pass rule #1
The trick with Intelligent Lookup is though that you can take those 80% unmatched records and pipe them into yet another rule of type exact match. This time we only compare city to city_name & country to country_name, thus leaving aside region / subdivision.
This rule is less strict, but is able to find a matching city record for 46% of the input records!! This means that for all those 7646 records, comparing just city & country names was good enough to uniquely make out the city (and thus its continent, latitude & longitude) in the lookup dataset.
The price for a less strict rule rule though is that it sometimes cannot finally make a decision because there is indeed more than one match. For example, those 102 sales transactions that belong to Walldorf/Germany cannot uniquely be assigned by this rule to any given record of the lookup datasets, because there are actually two cities of name Walldorf in Germany: one in Baden-Württemberg and one in Thüringen.
When you check the details in the screenshot, you will realize why the join (and the first rule) failed on this record: input region is Baden-Württemberg while lookup subdivision is Baden-Wurttemberg Region. These are obviously semantically the same, but dumb computers cannot see this – only smart humans can. This is why Intelligent Lookup puts back the control over match decisions back into the hands of the subject matter experts in the LoB by giving them the possibility to resolve these simple albeit crucial situations by themselves without the need for IT involvement.
As you see in the GIF, with just 2 decisions & clicks, the user matches 102 sales transactions for Walldorf/Germany as well as 1614 sales transactions for Berlin/Germany!
Add a third rule that does fuzzy matching on everything still unmatched
But the story doesn’t end there. Intelligent Lookup plays out its core strength by also supporting fuzzy comparisons in match rules. We now pipe those remaining 17% of unmatched cases straight into a fuzzy rule that jus checks on the city name. What this rule does is that it fuzzily compares the city name of the input records with each every city of the lookup records.
Due to HANA, even with 109.000 cities and ~3000 open input records, this is done in a couple of seconds and the result is actually pretty stunning: For 1222 records, we simply need to review and confirm the results – but note that we do this in groups of transactions that share the same LocationId (= your pairing column value).
Examples you see below are e.g. cases like Zurich ~ Zürich (damned be those German umlauts!), Furtwangen ~ Furtwangen im Schwarzwald or Leibnitz ~ Leibnitz, Styria. Users might need to reject certain records where fuzzy yielded unexpected results, but for most cases, this is just what we were looking for
As with rule #3, we again have multiple matches, i.e. match candidates that the rule proposes to us. For the example in the picture, Frankfurt/Germany, the fuzzy rule found 11 match candidates to work through. The same is true for many other places (e.g. 55 sales transactions for Saint Moritz/Switzerland fuzzily matching to Mori/Japan, Mori/Italy, Mor/Hungary as well as St. Moritz/Switzerland).
Reducing match candidates by adding rule #4
Users could now work through those match candidates manually as shown above, or they could let the system do this by letting a rule sieve through these lists of candidates. By chaining a rule to the multiple matches, users can effectively limit those match candidates even more and thus only work through the most relevant ones. In our case, for example by enforcing that the country on both sides actually need to be the same, we get rid of all those Frankforts in various US states.
As you see in the screen shot, rule #4 helped reduce the number of match candidates for Frankfurt/Germany down from 11 to 2, thus making it simple & efficient for business users to take the final call between those proposals.
Inspecting the output
In less than 15′ we now went from 13.000 unmatched records down to just a few dozen. We still need to work through some match candidates if we so choose, or we leave it at that.
In the output view configuration, we can now decide how we continue from here:
- We can either include or exclude unmatched elements, i.e. any record that is neither matched or in review (you will be able to exclude items in review very soon). If you choose to include unmatched records, obviously the value of their lookup columns (that we chose to be continent, latitude & longitude) is set to NULL.
- If you so please, you can assign a default value to them, allowing for a better reporting. In the example in the screenshot we set continent to Unknown and left latitude/longitude at their default NULL,
- You can optionally add own boolean indicator column that contains True for all unmatched records and False otherwise. This then allows users in upper views or in reporting to distinguish between default values and “real” values – something that e.g. could otherwise be hard for default dates or default numbers. For strings this should be less of an issue normally.
Building a view based on an Intelligent Lookup
The Intelligent Lookup is a first class citizen among DWC’s Graphical Views, SQL Views, Business Entities, Data Flows etc. This means that it can be used as data source for any of those just as if it was a normal view.
This is really neat, because it thus replaces the join that yielded insufficient results by a cleaner, fully reusable entity.
Consuming results from Intelligent Lookup in SAP Analytics Cloud
Now let’s rebuild the story from above in SAP Analytics Cloud. As you see in the screenshot, we see that the picture has entirely changed.We went from ~14000 unmatched records in category NULL down to 111 records in category Unknown within not much more than 15′ of work!
I let you make up your own minds on this, but we feel that with Intelligent Lookup, DWC is offering a transparent, iterative, efficient process which is a MASSIVE step forward for giving LoB users the tools in hand that they need to solve the data interaction challenges that cross their ways.
Intelligent Lookup is an entirely new and ground-breaking way of solving one of the most painful problems in data integration: bringing data together that semantically fits, but doesn’t quite fit on a technical (i.e. join key) level.
By allowing users to design their data knowledge into data matching rules that either automatically match the records or allow users to take a final match decision, Intelligent Lookup strikes a very clever balance between automation and end-user-centric decision making.
And don’t forget: if new data is loaded into the transaction table, all these matches will automatically be reused without any further user intervention!! For any record that has a LocationId for which a match to the lookup table already has been made (e.g. like we did above for the case of Walldorf/Germany above, essentially matching its foreign key LocationId 74 in the input entity to the table key geo_id GEO00072463 of the lookup entity), this match will now be used automatically. No further running of the rule required!
This means that we just created a re-usable entity whose value grows over time, that benefits from each and every decision by subject matter experts and don’t ever asks them again for the same match making.
We’ll be publishing a lot more information about Intelligent Lookup, because to be honest, we are really proud about it. Stay tuned!