Data federation is the process of aggregating data from different sources into a virtual database. This allows data to be combined resulting in more insightful analytics and business intelligence. The key advantage with data federation is that the data isn’t duplicated into another table but rather queried on the fly from the original source into a virtual table that is used for further analytics.
However, federating massive amounts of data can become time consuming. The query time for federation is not only long-delayed due to the source’s execution of the query, but also from the data transfer (latency) time from the source to SAP Data Warehouse Cloud. Here, I will state some features SAP Data Warehouse Cloud has to monitor performance, as well as go through some steps you can take to increase your federation query performance.
For this blog, I have used the sample database as provided by AWS Redshift, and uploaded the data as shown through AWS Redshift’s Getting Started Guide. To learn how to federate data from Redshift to SAP Data Warehouse Cloud, please refer to my former colleague’s blog, Data Federation Between SAP Data Warehouse Cloud and Amazon Redshift.
SAP Data Warehouse Cloud Performance Features
SAP Data Warehouse Cloud provides several features and tools that monitor queries and streamline performance. These features are summarized below.
- System Monitor. Monitor the performance of your system. You can see what tasks failed and dive deeper into the logs and view the statement that failed specifically.
- Restrict Remote Table Data Loads. Remove unnecessary columns and create filters. This will reduce the volume of data that is loaded in the remote table.
- Use the Data Builder to remove the unnecessary columns, define a central filter to load only the data that is needed, add new columns available in the source, or reinsert previously excluded columns.
- Cloud Connector for SAP HANA on premise.
- Run in Analytical modemay improve view performance, particularly if a union is performed.
- Remote Table Monitor. Create statistics for your remote table to improve query performance.
- Remote Query Monitor. Track queries sent to your source and analyze them.
- Activate the OLAP hint to view performance through a graphical view or a sql view.
Steps to increase query performance specific to Redshift federation
1. Geographic Locations
First and foremost, it is important to note that the geographic locations of your SAP Data Warehouse Cloud instances and hyperscaler data sources matter. If the regions are the same, the latency (or data transfer) time is reduced, thereby increasing your query performance.
2. Select only the columns you need when writing your query.
When performing the following query:
SELECT "salesid", "listid", "sellerid", "buyerid", "eventid", "dateid", "qtysold", "pricepaid", "commission", "saletime" FROM "sales_redshiftblog"
As compared to this query:
SELECT "listid", "qtysold", "commission", "dateid" FROM "sales_redshiftblog"
The top query takes 1 second to complete in SAP Data Warehouse Cloud, whereas the second one takes 755 ms to complete.
In Redshift, the top query took 995 ms, and the second query took 464 ms.
As such, it is important to only query what you need, as time and space is wasted when you query columns that do not provide value to your use case.
3. Sortkeys and Distkeys
When using sortkeys and distkeys to define your table, it appears that queries run faster when these keys are in the list of columns to select.
For example, when running
SELECT "salesid", "sellerid", "buyerid", "qtysold", "pricepaid", "commission", "saletime" FROM "sales_redshiftblog"
The query takes 6 seconds to complete in SAP Data Warehouse Cloud and 6 seconds in Redshift.
However, when running
SELECT "listid","qtysold","commission","dateid" FROM "sales_redshiftblog"
The query takes 755 ms to complete in SAP Data Warehouse Cloud and 464 ms in Redshift.
This is because the distkey defines how the data will be distributed across the slices for the computing nodes and the sortkey defines how the rows are ordered within the table. By using both, the data is first distributed across the computing nodes based on the distkey provided (for the sales table here, that would be listid). Then each of the slice’s data are sorted based on the sortkey (which is dateid in this case). When we run queries that don’t utilize these keys, we are not actually taking advantage of the increase in performance that these keys provide. As such, if you notice that your queries are running slower after setting a distkey and/or sortkey, then you may need to revisit what columns should be assigned to these keys.
4. Use bool, int, float, or date columns in the where clause.
Below, I have run 2 different queries. One with the WHERE clause on eventname, and another with the WHERE clause on catid. The query with the WHERE clause on eventname took 4 seconds to return 44 rows, whereas the query with the WHERE clause on catid took 4 seconds to return 1,020 rows. Since the query time was the same between the two, but the number of rows was higher with catid, it shows that where clauses do perform better on int columns over string columns.
select "eventid","venueid","catid","dateid","eventname","starttime" from "event_redshiftblog" where eventname = 'Grease'
select "eventid","venueid","catid","dateid","eventname","starttime" from "event_redshiftblog" where catid = 6
5. Order by and join on int64 columns rather than string columns.
Joining on strings slows down the query as the strings are compared. By using int64, the comparison is faster, thereby increasing your query performance.
When running an order by query on sellerid (shown below), the query takes 783 ms.
select sellerid, username, (firstname ||' '|| lastname), city, sum(qtysold) from "sales_redshiftblog" JOIN "users_redshiftblog" ON "sales_redshiftblog".sellerid = "users_redshiftblog".userid group by sellerid, username, (firstname ||' '|| lastname), city order by sellerid desc
However, when running an order by query on city (shown below), the query takes 999 ms.
select sellerid, username, (firstname ||' '|| lastname), city, sum(qtysold) from "sales_redshiftblog" JOIN "users_redshiftblog" ON "sales_redshiftblog".sellerid = "users_redshiftblog".userid group by sellerid, username, (firstname ||' '|| lastname), city order by city desc
One thing to note is that the query pushed down to Redshift doesn’t include the order by condition. This means the query that is pushed to Redshift is the same and the order by clause takes place in SAP Data Warehouse Cloud. The first picture below refers to the order by sellerid query. The second picture refers to the order by city query. As can be seen by the SQL, the queries are the same and the order by clause is not included.
6. Use sort keys in where clause
Below, I ran two different queries. One with a where condition on the sortkey for the table (dateid), and another on some other column in the table (eventid). The query times alongside the number of rows returned indicate that using sortkeys in your where clause are beneficial in increasing your query performance. In summary sort keys define how the data in each disk is sorted. As such, when you utilize the way you have sorted such data, you are allowing the query plan to skip entire blocks of data that don’t fall within the range that is specified in the where clause. (For a refresher on the importance of sortkeys, please refer to AWS Redshift’s documentation here).
select listid, numtickets, priceperticket, totalprice from "listing_redshiftblog" where eventid = 1811
select listid, numtickets, priceperticket, totalprice from "listing_redshiftblog" where dateid = 1827
7. SAP Data Warehouse Cloud provides two options for replication – snapshots and real-time replication. Some sources do not support real-time replication. To see information on if other source connectors supports real-time replication, please look here.
- Snapshot – copies the full set of data from the source object into SAP Data Warehouse Cloud.
- Improves performance when accessing data, but data is now stored in Data Warehouse Cloud and takes up memory.
- Replicated data is stored in a replica table
- To update this table, you can create a schedule. A schedule will run snapshot replication asynchronously and recurrently in the background according to the settings defined in the schedule.
- Real-time replication – copies the full set of data from your source recurrently and asynchronously in the background. This is good for copying data changes in real-time into SAP Data Warehouse Cloud.
8. Turn on in-memory storage for your table to move data from being stored on the disk to the memory.
At the same time, we shouldn’t forget the general guidelines for using Redshift. Always design your tables in the most efficient way for your queries (source). Remember, Redshift distributes the rows of the table to each of the node slices (that make up the cluster) according to the table’s distribution style. This means that data is moved when a query runs. So, by accurately defining your data distribution style, you are minimizing data movement during query runtime and effectively reducing a substantial portion in cost for your query plan. In addition, data redistribution also increases your network traffic and can affect other database operations, making it even more important to correctly define your table’s distribution style to lessen these impacts.
Also remember, Redshift stores data in sorted order according to a table’s sorted keys. You can define such keys when you create a table. When you use sorted keys on a table, the query planner uses it to construct plans that utilize the way that the data is sorted. An example given by Redshift to take advantage of sorted keys is to sort your data by date if you anticipate the majority of your queries being based on date ranges. Then when you query this table, if you query a subset of date ranges, you are allowing the planner and processor to skip over the blocks that don’t contain the data specified in the range.
Check out these additional Redshift resources for additional guidelines:
- Ways to analyze and improve your queries on Redshift’s side
- Redshift has created their own documentation on the best practices for designing queries. To view these practices, please refer to this link.
- Query tuning in Redshift
- Factors affecting query performance
In summary, this blog has focused on several ways you can analyze and increase your query performance when federating data from Redshift to SAP Data Warehouse Cloud. We discussed the features provided in SAP Data Warehouse Cloud that help you further analyze your queries in it, as well as 8 different ways to structure your queries to increase your federation query performance. While some of these observations and statements may be obvious to some, this blog aims to highlight the importance of query structures and additional features that SAP Data Warehouse Cloud provides when federating data.
I hope this blog was helpful to you! If it was, please click the like button and leave a comment below. And if you have any feedback about the blog, please feel free to add your feedback in the comments below as well.
Thank you 🙂