The SAP HANA Cloud, data lake was released in early 2020. However, the technology has been in use for quite some time. If you are familiar with SAP IQ, then you already know quite a bit about the SAP HANA Cloud, data lake as it is based on SAP IQ.
While most of the concepts of SAP IQ transfer to the SAP HANA Cloud, data lake, not all do. This blog will focus on the indexing strategies that are available in the SAP HANA Cloud, data lake (SAP IQ) and when to use them.
As SAP IQ is the underlying technology for SAP HANA Cloud, data lake, SAP IQ will be used when describing the technical features.
When creating a table, a default index is placed on every column. This index is referred to as a Fast Projection (FP) index and has built in compression. NBit is the compression scheme employed. It uses <n> bits to index the dictionary where the data is stored and is enabled on every column except LOB and BIT column types.
Besides compression, the FP index also employs a feature called zone maps that can improve performance when executing queries.
A zone map, also known as a storage index, data skipping, or constraint exclusion, optimizes scan processing by ignoring complete pages of values if the server can guarantee that there is no value of interest to the predicate(s) being searched on that page. A zone map is a data structure containing the min/max values of each page of column data, and is stored as part of the FP index.
A zone map is most effective if no other index (besides an FP index) exists on a column or the values in a column are arranged in a sequential order (for example, by transaction date). A zone map is less effective for tables with few rows, or for columns that have a uniform distribution of values across all pages. Zone maps require the values in the predicate(s) be fixed; not derived from a function, and supports four classes of predicates: equality (=), inequality (<, >, !=), between, and not between.
In addition to the basic FP indexes, SAP IQ supports these indexes:
- High Group (HG) – used for join columns and searchable columns whose data has more than 2,000 distinct values
- Word (WD) – Use for whole word searching of character based data
- Text (TEXT)– Use for full text searching of character based data
- Date (DATE) – Use for fields that have a DATE data type
- Time (TIME) – Use for fields that have a TIME data type
- Datetime (DTTM) – Use for fields that have a DATETIME/TIMESTAMP data type
- Compare (CMP) – Indexes the relationship (<, >, =) between any two distinct columns with identical data types, precision, and scale
The following indexes are included only for the reference of customers that have migrated from SAP IQ. Both of these indexes have been deprecated and should be replaced by the HG index as it offers better performance.
- Low Fast (LF) – used for join columns and searchable columns whose data has fewer than 2,000 distinct values
- High Non Group (HNG) – used for range searches (except for date, time, and datetime data types) and for single column aggregations
In most relational database engines, the optimizer chooses the best single index for a given table in the query. SAP IQ takes the optimization process to the next level by looking not only at each column in the query, but at all joins and search arguments.
For example, take this query:
select a, b, c from table_a, table_b where table_a.date_field = table_b.date_field and table_a.date_field between ‘2019-01-01’ and ‘2019-02-01’ and month( table_a.date_field ) = 1
The optimizer could choose up to three different indexes for the date_field column on table_a. First, an index for the join to table_b would be needed. Second, an index for the range search in the BETWEEN statement would be needed. Lastly, the date function MONTH() can also use an index for the comparison to 1 (January).
Given the above index types and their use, the optimizer would prefer to use the HG index on the date field for the join. It would also prefer to use the DATE index to search on the month datepart as well as the BETWEEN range search.
In short, the optimizer will look at each join, where clause, aggregation, etc. to determine the best index (including zone maps) for each operation. Not just one index per table and not just one index per column. But rather, one index for each use of each column in the query!
The short answer is an emphatic NO! As mentioned previously, typical relational database engines pick the best single index to access data in a table. Second, when adding indexes to most database engines the load times tend to increase due to how data is processed for the additional indexes. Third, most database engines store data and indexes in different structures so that as more indexes are added the table size swells. All of these lead to trying to minimize the total number of indexes on a table in most engines.
SAP IQ is not prone to these limitations. The optimizer will pick as many indexes as needed to process the known set of queries or expected use of the data. Indexes in SAP IQ are not as large as the data being indexed. Each of the indexes employs various levels of vectors, bitmaps, data representations, and compression to dramatically decrease the size. Lastly, when loading data, SAP IQ can consume all CPU resources on the machine that is processing the load.
These features lead to an engine that can handle as many indexes as necessary for all query (canned or ad hoc) activity. It is quite common to have 30-60% of all columns indexed with the advanced indexes (in some cases, multiple indexes per column).
Now that we have some basics about indexes, the big question is when to use them. Here are some simple guidelines that will help cover every situation. The general idea is to ask ourselves questions about how the data will be used to search (or join). The answers to the questions will dictate whether or not indexes are needed and which one(s) are best.
As the technology has matured, the way that indexes are used and how they are thought of has changed. Historically, we would simply index any field in the database that would be used for searching or aggregation. With modern systems growing into the hundreds of terabytes and petabyte scale, this leads to significantly larger systems. A 10% growth in database size due to additional indexes is significant when dealing with very large systems. Additionally, the optimizer and default storage structures became significantly smarter and allowed for data to be unindexed and still perform as if it were.
As a result, the idea of when to use indexes became significantly easier and more refined. I prefer to focus on only those columns that I feel will be used more than 50% to 75% of the time, preferably only those columns used to search or join in more than 75% of the queries. I do this to reduce the overhead for infrequently used columns and indexes. Additionally, with the NBit and zone map features, not having an additional index on a column is not as detrimental as the default storage does offer level of indexing for all columns.
Secondly, if the column has a data type of DATE, TIME, TIMESTAMP, or DATETIME, we need to add a DATE, TIME, or DTTM index in addition to an HG index. In my opinion, all date/time/datetime fields should have an HG index and one of the date indexes as systems quite frequently use date fields for searching, filtering, and aggregations.
If we simply stopped here, well over 90% of all use cases would be covered.
The remaining indexes have specific use cases: WD, TEXT, and CMP indexes. To cover these indexes and their use, consider these questions:
- Will whole word searching (word containment) be done on the character data?
- NO: Ignore the column
- YES: Use WD (WORD) index. No HG index is needed if word searching is the only use of this column.
- Will text searching be done on the character data?
- NO: Ignore the column
- YES: Use a TEXT index. No HG index is needed if text searching is the only use of this column.
- Will this column be compared to another column in the same table?
- NO: Ignore the column
- YES: Use a CMP (compare) index that includes both columns
To provide a bit more on the rationale of the indexing scheme described, we have to consider that there are two ways of classifying the indexes in SAP IQ: those that have detailed statistics and those that have minimal statistics. The HG, TEXT, and WD indexes contain detailed statistics. The CMP, DATE, TIME, and DTTM indexes contain minimal statistics.
The optimizer is greatly helped by having detailed statistics to make any decisions. Consequently, the HG index should be on any column that is used a majority of the time (50-75%). If we consider the previous example of a column that is a DATE data type, using a DATE index is appropriate. However, the DATE index does not have the detailed statistics for the optimizer, so adding an HG index is strongly encouraged. In short, any time a column is used in the WHERE clause, we want that column to have an HG index on it to provide detailed statistics for the optimizer.
Of course, the caveat to this would be the WD and TEXT indexes as they contain the statistics necessary for the optimizer. Additionally, these fields are generally wide text fields with a significant percentage of unique values. Having an HG index on those fields would be expensive in terms of space and time to load the data. If using a WD or TEXT index, do not use an HG index unless the field unless it is used in other types of searching.
The process to choose indexes may seem daunting especially if the data model has a significant number of tables and columns. The process outlined above assumes that very little is known about the application and its use of the data. This is, generally, not the case as most DBAs and data modelers do know more about the data and its use.
Should you know more about the model and data, the process can be simplified. I prefer to look at columns that are known to be used in WHERE clauses of most queries and reports. Indexes on join and search criteria are the single biggest factor in performance, so focusing solely on those will allow IQ to be indexed for nearly every situation.
Certainly some columns and indexes will be missed with this approach. However, since the focus was on columns that are heavily used, the impact of missing a few minor, or less frequently used, columns would be relatively low.
The underlying SAP IQ technology used the SAP HANA Cloud, data lake is quite forgiving when it comes to indexes. We don’t need to be completely accurate to begin with. We can simply focus in on those columns we know will be used a majority of the time. This approach allows for minimal effort to achieve a near maximum outcome of a finely tuned data lake.