EDUCAÇÃO E TECNOLOGIA

How to make SAP Data Services better

I am married with SAP Data Services (BODS) since the year 1999. While employed by Acta and later BusinessObjects and SAP, I used BODS as consultant, trained, created trainings and certifications, worked as developer for it, was one of the product managers and had a bit of an architectural role as well. Some features I am proud of, some got implemented despite my arguments against and for one or two you might want to curse me.

Meanwhile the world has moved on and it would be a good time to carry BODS into the new world order as well.

Installation

My first change would be the installation of BODS. The biggest mistake we made was the Business Objects integration. The argument back then was that the same people using BO are also using BODS but I never bought into that. In BO you maintain the permissions for 1000s of report users but the BODS users are just two or three with different types of permissions. So what is the value of the integration? What are the commonalities?

Anyway, meanwhile the question should be sorted out, the BO Enterprise Server no longer is the center of all future development at SAP, is it? Further more, the integration was never finished, e.g. central repo users are still not maintained in BOE but in the BODS admin web pages.

Removing BOE is pretty simple. The Designer application connects to the repository database using the database credentials, just as it did before the BOW integration.

Two features only the BOE integration provides are the read-only repos and a quick repo switch. This would need to be enabled via a direct database login as well. We would need a functionality where one database login can use different repos, for some the user has full permissions, for others only read permissions. Such functionality exists – it is called database security. The user logs in with his personal database login and will be presented with a list of all database schemas that contain an AL_LANG table and he can see. These are obviously all the repositories this user has access to. If he has read and write permissions on this table, he can use it to design new objects, with select permissions only it is a read only repo.

Jobserver installed as container image

The jobserver and the webserver I would wrap into a download-able container image. Then installing a new jobserver is as simple as executing a “docker run” command. Upgrades, job server groups, scaling, multiple jobserver on the same host computer,… all no issue any longer.

With these two relatively simple changes in the product, the installation would be done in minutes instead of hours, would essentially never fail and upgrades are easy also.

Designer UI

While the al_engine is state of the art, the Designer application looked old fashioned even 15 years ago, not to mention how it feels today. In addition the Designer has the following issues:

  • New transforms require code changes in lots of places. There is no good object hierarchy in the code. That is the reason why no transforms were added in the last 10 years – too expensive.
  • In some areas the Designer screen looks like: If the backend has 100 switches, the UI screen shows 100 switches. That is not user friendly.
  • Some settings the user has to make although they could be read from the connected systems. Example: database codepage.
  • A web-based UI would be nice if it is equally powerful as the Windows UI. If not, then better stick to a native OS application.

The Data Services Workbench was the first try to revamp the underlying technology and its aim was to create 1:1 port of the Designer in a modern Eclipse IDE. The important parts were implemented quickly but the rest never made it into the product. Management questioned correctly what the value of replacing a C++ UI by an Eclipse UI is, when it is the same thing. Hence the project was stopped and now we have Designer and Workbench. Designer is superior in almost everything, Workbench got some goodies in the area of Data Quality transforms. Which one to use???

Together with another developer I started a proof of concept showing how a new UI should look like. We took the good parts of Designer, got rid of the annoying parts and all customers we showed the result were thrilled. The already excellent efficient workflow got improved by at least a factor of two with this UI. The highlights where:

  • Data Driven configuration of transforms as secondary option. Some things can be done via data driven UIs much more efficient than via formulas. For example joining two tables along their foreign key relationships by simply showing the ERDiagram is easier to understand than a where/join clause over 5 lines.
  • But Data Driven UIs must be just a different view of the same thing. You still can see the formulas if you like. And if you change the formula, the data driven UI does change as well. Until the point is reached where the data driven representation does no longer make sense. Then this tab is readonly or disabled. For example a theta join “where a.emp=b.emp and a.date between b.from and b.to” can be visualized as a line between two tables but there is not much value in it. You cannot drag the line from one column to another an expect the formula to make sense still.
  • Automapping is available in the Designer also, but only if the columns have the same name. But the situation where you have an output of a transform and an existing target table and you need to align the two via a query transform is a task everybody has to do thousands of times. We implemented an automapping that calculated the probability of a mapping based on column name (source: FROM_DATE, target: T_FROM_DATE would get a high rank), data type (varchar(100) -> varchar(100) gets a high ranking, varchar(10) -> varchar(100) less and int -> date a very low), existing mapping formula and other information available. Then you could set a threshold and all above where checked as “map those” and the user can add some other columns also.

But that was the time when Cloud started and the project was killed. Only web frontends are good frontends was the mantra. Nobody can be bothered to download an application and install it. Granted, a web frontend would be better if the user efficiency is equal, the development of it equally expensive and many users are working with it. But none of these statements were true and frankly, even today you install IDEs and use web frontends only occasionally.

The biggest mistake in that area was however to question the user requirements. The ETL engine has so many features and options not for the fun of it, but because they are needed. For example not every customer is using all transforms, we can agree on that. Drawing the conclusion that because all customers use a query transform the query transform is the only transform needed is … brave.

You get a Data Integration tool that allows to build 80% of the data flows the user needs. Such tool is worthless. The customer cannot say, well at least we loaded the order header table, line items we cannot. Too bad, but 80% of the tables is enough, isn’t it?

No, the philosophy must be the one Data Services is built on: Normal things can be done easily, complex things possbile and for things the tool has no out-of-the-box solution there are workarounds possible.

This mistake of 80%-of-the-tables-are-enough was repeated over and over. Data Services Workbench. Cloud Platform Integration – Data Service (CPI-DS). Agile Data Preparation (ADP).

What do Data Services users use today? The same old Designer.

Engine

Once the Designer issue is fixed, it would be time to look into missing transformations and functions, contradictions within the product (e.g. a dataflow with a global variable can be valid and invalid at the same time), little enhancements of some transformations (history preserving transform comes to mind).

A larger topic would be Change Data Capture. Instead of using database specific objects to read changes or not provide a CDC solution at all, CDC should be standardized. In other words building delta dataflows should be easy. In fact most of the time there should be a single dataflow only and the optimizer knows what to do at initial load and delta load.

But about those things later, in case I get a lively feedback in the comments of this post.