8 ways to increase your query performance in SAP Data Warehouse Cloud when federating from Big Query

Background

As we probably know by now, SAP’s Data Warehouse Cloud provides customers and users the ability to federate their data from different sources in real time into virtual tables in SAP Data Warehouse Cloud. These virtual tables query your remote source on the fly and allow you access to these data tables, should you want to perform further analytics on it and/or combine data sources. However, querying massive amounts of data can result in slow query times. As such, this blog focuses on some key points one should consider to increase their query performance when federating data from Big Query. These points cover geographic locations, column selection, joins, partitions, where clauses, reduction of data, and more.

It should be noted that Big Query does have its own support in SAP Data Warehouse Cloud for federation, making the connection creation seamless.

For this blog, I will be using the combined train and test dataset from the advertisements dataset provided by Kaggle here and here. For the revenues column in test data, I inserted the predictions given from here.

SAP Data Warehouse Cloud Performance Features

SAP Data Warehouse Cloud provides several features and tools to help one monitor their queries to increase performance. These features are bullet pointed below.

  • System Monitor. Monitor the performance of your system. You can see what tasks failed and dive deeper into the logs and view the statement that failed specifically.
  • Restrict Remote Table Data Loads. Remove unnecessary columns and create filters. This will reduce the volume of data that is loaded in the remote table.
    • Use the Data Builder to remove the unnecessary columns, define a central filter to load only the data that is needed, add new columns available in the source, or reinsert previously excluded columns.
  • Cloud Connector to improve performance for federation and replication from SAP HANA on-premise.
  • Run in Analytical mode may improve view performance, particularly if a union is performed.
  • Remote Table Monitor. Create statistics for your remote table to improve query performance.
  • Remote Query Monitor. Track queries sent to your source and analyze them.
  • Activate the OLAP hint to view performance.

Additional steps to increase query performance specific to Big Query’s federation

  1. First and foremost, it is important to note that the geographic location that your SAP Data Warehouse Cloud instance and hyperscaler data source are in matters. If the regions are the same, the latency (or data transfer) time is reduced, thereby increasing your query performance.
  1. Big Query is a columnar storage database, so there are benefits in performance if ones queries only the columns they need. This is because the database is organized by fields (columns), thus reducing the number of disks that will need to be visited and minimizing and the amount of extra data that must be held in memory. We are only processing the relevant columns in the query.
  1. As stated by Big Query’s documentation, order by and join on int64 columns rather than string columns. Joining on strings slows down the query as the strings are compared. By using int64, the comparison is faster, thereby increasing your query performance.

SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY ad

SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY revenue

If we run these queries in Big Query, we see (pictures below) that the slot time and the byte shuffled reduced when we changed from ordering by ad to revenue. This is because big query int64 takes up 8 bytes of space, whereas a string takes up space depending on its length. As such, with int64 having smaller bytes to scan, the query will run faster.

Execution%20time%20of%20SELECT%20adgroup%2C%20ad%2C%20impressions%2C%20clicks%2C%20conversions%2C%20revenue%20FROM%20%26%23x60%3Bproject_name.bq_performance.ad_revenues%26%23x60%3B%20ORDER%20BY%20ad

Figure 1: Execution time of SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY ad in Big Query

SELECT%20adgroup%2C%20ad%2C%20impressions%2C%20clicks%2C%20conversions%2C%20revenue%20FROM%20%26%23x60%3Bproject_name.bq_performance.ad_revenues%26%23x60%3B%20ORDER%20BY%20revenue

Figure 2: Execution time of SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY revenue in Big Query

  1. Use partitions and clusters when creating your table. It can reduce data scans. If you’re not sure what a partitioned table is, please read about it here.

Please note these queries were run using the sqlscript language in the sql editor.

output = select "adgroup", "ad", TO_DECIMAL(avg("clicks")) as averageclicks, "revenue" from "ad_revenues" GROUP BY "adgroup", "ad", "revenue" LIMIT 4000000;
return :output;

This query took 0 seconds in big query and 1 second in SAP Data Warehouse Cloud.

Execution%20time%20in%20Big%20Query%20when%20using%20a%20partitioned%20tableExecution%20time%20in%20Big%20Query%20and%20SAP%20Data%20Warehouse%20Cloud%20when%20using%20a%20non-partitioned%20table

Figure 3: Execution time in Big Query and SAP Data Warehouse Cloud when using a non-partitioned table

output = select "adgroup", "ad", TO_DECIMAL(avg("clicks")) as averageclicks, "revenue" from "ad_revenues_partitioned" GROUP BY "adgroup", "ad", "revenue" LIMIT 4000000;
return :output;

This query took 0 seconds in big query and 846 ms in SAP Data Warehouse Cloud.

Execution%20time%20in%20Big%20Query%20and%20SAP%20Data%20Warehouse%20Cloud%20when%20using%20a%20partitioned%20table

Figure 4: Execution time in Big Query and SAP Data Warehouse Cloud when using a partitioned table

Here, the first query was performed on the non-partitioned table and the second query was performed on the partitioned and clustered table. The table was partitioned based on clicks with the range set from 0 to 1772, and the interval as 100. The table was clustered based no adgroup and ad.

If we run this query in Big Query, we can further analyze and view how the elapsed time and bytes shuffled reduced. The first picture below is the query details from ad_revenues and the second picture shows the query details from ad_revenues_partitioned. The elapsed time reduced by around 70 ms and the bytes shuffled reduced by over 200 KB.

select adgroup, ad, avg(clicks) as averageclicks, revenue from `project_name.bq_performance.ad_revenues` GROUP BY adgroup, ad, revenue

Detailed%20Information%20on%20Execution%20of%20Query%20on%20a%20non-partitioned%20table%20in%20Big%20Query

Figure 5: Detailed information on execution of query on a non-partitioned table in Big Query

select adgroup, ad, avg(clicks) as averageclicks, revenue from `project_name.bq_performance.ad_revenues_partitioned` GROUP BY adgroup, ad, revenue

Detailed%20information%20on%20execution%20of%20query%20on%20a%20partitioned%20table%20in%20Big%20Query

Figure 6: Detailed information on execution of query on a partitioned table in Big Query

However, the real benefit of the partitioning and clustering is shown through the following queries:

output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "clicks" > 1000 LIMIT 200000;
return :output;

This query took 0 seconds in Big Query and 862 ms in SAP Data Warehouse Cloud.

Execution%20time%20of%20query%20on%20non-partitioned%20table

Figure 7: Execution time in Big Query and SAP Data Warehouse Cloud of query on non-partitioned table

output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues_partitioned " WHERE "clicks" > 1000 LIMIT 200000;
return :output;

This query took 0 seconds in Big Query and 779 ms in SAP Data Warehouse Cloud.

Execution%20time%20of%20query%20on%20partitioned%20table

Figure 8: Execution time in Big Query and SAP Data Warehouse Cloud of query on partitioned table

Due to the partitioning, the records read were reduced to only the records that needed to be read, thereby reducing the query time.

Again, if we further analyze this query in BigQuery, we see that the elapsed time, slot time, and records read reduced.

SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000

SELECT%20adgroup%2C%20ad%2C%20clicks%20FROM%20%26%23x60%3Bproject_name.bq_performance.ad_revenues%26%23x60%3B%20WHERE%20clicks%201000

Figure 9: Execution time of SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000 in Big Query

SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues_partitioned` WHERE clicks > 1000

Execution%20time%20of%20SELECT%20adgroup%2C%20ad%2C%20clicks%20FROM%20%26%23x60%3Bproject_name.bq_performance.ad_revenues_partitioned%26%23x60%3B%20WHERE

Figure 10: Execution time of SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues_partitioned` WHERE clicks > 1000 in Big Query

  1. According to Big Query’s documentation, it is a best practice to use bool, int, float, or date columns in the where clause. This is because operations on these data types are faster than operations on String or Byte columns.

output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "clicks" > 1000 LIMIT 200000;
return :output;

This query took 0 seconds in big query and 862 ms in SAP Data Warehouse Cloud.

Execution%20time%20of%20query%20including%20WHERE%20clause%20on%20int%20column

Figure 11: Execution time in Big Query and SAP Data Warehouse Cloud of query with WHERE clause on int column

output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "adgroup" LIKE 'adgroup 1' LIMIT 2000000;
return :output;

This query took 1 seconds in big query and 23 seconds in SAP Data Warehouse Cloud.

Execution%20time%20of%20query%20with%20WHERE%20clause%20on%20string%20column

Figure 12: Execution time in Big Query and SAP Data Warehouse Cloud of query with WHERE clause on string column

Again, if we analyze on the big query side, we see that the elapsed time was smaller when we used clicks as the column in the where clause, as opposed to adgroup. It is important to note that if you would like to use a where condition on the adgroup, one could create a column that has a number representation of the adgroups or create a view in big query that does the calculation automatically, to make the query runtime be faster.

SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000

This query took 161 ms in Big Query.

Detailed%20information%20on%20execution%20of%20query%20with%20WHERE%20clause%20on%20int%20column%20in%20Big%20Query

Figure 13: Detailed information on execution of query with WHERE clause on int column in Big Query

SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE adgroup LIKE 'adgroup 1'

This query took 1 second in Big Query.

Detailed%20information%20on%20execution%20of%20query%20with%20WHERE%20clause%20on%20string%20column%20in%20Big%20Query

Figure 14: Detailed information on execution of query with WHERE clause on string column in Big Query

  1. Reduce data (if you can) before using a JOIN.

In option1’s query here, we are telling SAP Data Warehouse Cloud to fetch all of the rows from big query and then join each row with the advertisement_data table. This slows down the query a lot.

SELECT ad1."ad", ad1."product_focus", sum("revenue")
FROM "advertisement_data" as ad1
JOIN "ad_revenues" as ar on ar."ad"=ad1."ad"
GROUP BY ad1."ad",ad1."product_focus"
ORDER BY sum("revenue") DESC

In option2’s query here, we are instead telling SAP Data Warehouse Cloud to fetch a smaller set of data (the data that we really need) to perform the join. As such, the query time is significantly faster, as only 75 rows are federated over instead of over 4 million.

SELECT ad1."ad", ad1."product_focus", sumrevenues
FROM
(
(
SELECT "ad", sum("revenue") as sumrevenues
FROM "ad_revenues"
GROUP BY "ad"
ORDER BY sumrevenues DESC
) as r
JOIN "advertisement_data" AS ad1
ON ad1."ad" = r."ad"
)
ORDER BY sumrevenues DESC

You can view the time difference from the SAP Data Warehouse Cloud queries here:

Option 1’s query to big query took 1 minute to complete.

Execution%20time%20of%20query%20with%20JOIN%20that%20does%20not%20reduce%20columns

Figure 15: Execution time of query in SAP Data Warehouse Cloud with JOIN that does not reduce columns

Option 2’s query to big query too 529 ms to complete.

Execution%20time%20of%20query%20with%20JOIN%20that%20reduces%20columns

Figure 16: Execution time in SAP Data Warehouse Cloud of query with JOIN that reduces columns

  1. SAP Data Warehouse Cloud provides two options for replication – snapshots and real-time replication. Some sources do not support real-time replication (as is with the case for Big Query’s connector). To see information on if other source connectors supports real-time replication, please look here.
    • Snapshot – copies the full set of data from the source object into SAP Data Warehouse Cloud.
      • Improves performance when accessing data, but data is now stored in Data Warehouse Cloud and takes up memory.
      • Replicated data is stored in a replica table

Loading a new snapshot for the ad_revenues_partitioned table in Big Query, allows us to query from the snapshotted table instead of the actual table in Big Query, thus also increasing query performance.

It took about 4 minutes to create this snapshot.

How%20to%20load%20a%20snapshot

Figure 17: How to load a snapshot

To update this table, you can create a schedule. A schedule will run snapshot replication asynchronously and recurrently in the background according to the settings defined in the schedule.

How%20to%20create%20a%20schedule

Figure 18: How to create a schedule

    • Real-time replication – copies the full set of data from your source recurrently and asynchronously in the background. This is good for copying data changes in real-time into SAP Data Warehouse Cloud. However, real-time replication isn’t supported for Big Query’s connection.
  1. Turn on in-memory storage for your table to move data from being stored on the disk to the memory.

Conclusion

In summary, this blog has focused on several ways one can analyze and increase their query performance when federating data from Big Query to SAP Data Warehouse Cloud. We discussed the features provided in SAP Data Warehouse Cloud that help one further analyze their queries in SAP Data Warehouse Cloud, as well as 8 different ways to structure your queries to increase your federation query performance. While some of these observations and statements may be obvious to some, this blog aims to highlight the importance of query structures and additional features that SAP Data Warehouse Cloud provides when federating data.

Please let me know if this blog was helpful to you!

Thank you 🙂