Index - Btree (When Should You use it)

1 - About

When you should we use a btree index.

Rule of thumb

  • if you're going to access a very small percentage of the rows in the table via the index
  • if you are going to process many rows of a table and the index can be used instead of table
If you use the index to access the table, then you will need to ensure that you are processing a small percentage of the total table. 10% selectivity is the minimum selectivity necessary for a b-tree index to be helpful

In general
B*Tree index would be placed :

  • on columns that we use frequently in the predicate of a query,
  • and we would expect :
    • some small fraction of the data from the table to be returned
    • or else the end user demands immediate feedback

3 - Oracle Example

3.1 - Explain Plan

You will read the index to get a row in the table. Here, you want to access a very small percentage of the rows in the table

[email protected]>set autotrace traceonly explain
[email protected]>select owner, status from t where owner=user;

Execution Plan
Plan hash value: 470836197

| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT            |       |     7 |   154 |     7   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |     7 |   154 |     7   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T_IDX |     7 |       |     3   (0)| 00:00:01 |

With a query plan like this, you should be accessing a very small percentage of this table.

The issue to look at here is the :

  • followed by the TABLE ACCESS BY INDEX ROWID.

This means that Oracle :

  • will read the index
  • and then, for the index entries, it will perform a database block read (logical or physical read) to get the row data.

3.2 - Bad example of a use of a btree index

When we read through an index to access the table, we will perform lots of scattered, random I/O. By scattered, I mean that the index will tell us to read :

  • block 1
  • then block 1,000
  • block 205
  • block 321
  • block 1
  • block 1032
  • and so on

It won't ask us to read :

  • block 1
  • then block 2
  • block 3

in a consecutive manner and the single block I/O can be very slow.

This is not the most efficient method if you are going to have to access a large percentage of the rows. It will be generally take longer to access them via B*tree than by just full scanning table if we access a too high a percentage of the rows around :

  • For a thin table (i.e., a table with a few or small columns), 2 to 3 percent or less of the rows
  • For a fat table, 20 to 25 percent of the table

For a thin table :

  • 100,000 rows in the table
  • 20 percent of the rows to read (so 20,000 rows)
  • a row is 80 byte in size
  • database with a 8kb block size

We have then :

  • 100 rows per block (8 kb / 80 byte)
  • and a table with 1,000 blocks (100 000 / 100)

From here, the math is very easy. We are going to read 20,000 rows via the index. This will mean, quite likely 20,000 TABLE ACCESS BY ROW ID operations.

We will process 20,000 table blocks to execute this query. However, there are only 1,000 blocks in the entire table !

(Even with a 800 bytes per row and 10 rows per block, we now have 10,000 blocks in table).

In this case, a full table scan will be much more efficient than using an index, as it has to touch each block once.

Any query that used this index to access the data would not be very efficient until it accesses :

  • on average less than 5 percent of the data for the 800 byte column (then we have about 5,000 blocks)
  • and even less for the 80 byte column (about ),5 percent or less)

3.3 - The Physical Organization Component in this example

How the data is organized physically on disk deeply impacts these calculations, as it materially affects how expensive (or inexpensive) index access will be.

If the table is naturally clustered in order by the primary key and you issue the query

select * from primary_key between :x and :y

an index range scan may be useful even if it accesses a large percentage of rows, simply because the database blocks that we need to read and reread will most likely cached, since the data is co-located.

Table CPU Time Logical I/O
Co-located 0.59 seconds 14,495
Disorganized 0.85 seconds 106,815
Co-located % 70% 13%

The query against a disorganized table bears out the simple math we did earlier. Here is the perfect illustration of why rules of thumb are so hard to provide :

  • in one case, using index work great
  • in the other case it doesn't

So when you have to respond “Why is it running differently between the test and production machine ?” Because they are not identical.

3.4 - To answer a query

The index contains enough information to answer the entire query - we will not have to go to the table at all. The index will be used as a thinner version of the table.

You can process 100 percent (or any percentage, in fact) of the rows via the index. You might use an index just to create a “thinner”version of the table. The following query demonstrates this concept.

[email protected]>select count(*) from t where owner = user;

Execution Plan
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT  |       |     1 |    17 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |       |     1 |    17 |            |          |
|*  2 |   INDEX RANGE SCAN| T_IDX |     7 |   119 |     3   (0)| 00:00:01 |

Here only the index is used to answer the query. The underlying table is never accessed, we simply scanned the index structure itself.

3.5 - How to find low selectivity indexes

Run this query to get an idea of how your indexes are set up.

  NUM_ROWS > 0

4 - Reference

Data Science
Data Analysis
Data Science
Linear Algebra Mathematics

Powered by ComboStrap