Elmasri/Navathe (3^{rd} Edition)  Connolly and Begg (3^{rd} Edition) 

Chapter 18  Chapter 20 
A query is processed in four general steps:
on the following schema:
 


The left hand tree is evaluated in steps as follows:
The right hand tree is evaluated in steps as follows:
1. Original Query Tree  2. Use Rule 1 to Break up Cascading Selections 
3. Commute Selection with Cross Product  4. Combine Cross Product and Selection to form Joins 
Note: The following notes are based upon materials presented in the Connolly/Begg 3rd edition. The notations differ between textbooks.
Variable  Connolly/Begg  

Number of Records (tuples) in a table (relation)  nTuples(R)  
Record (tuple) Size in bytes   
Blocking Factor (records per block)  Spanned: bFactor(R) = Block Size / Record Size(R)
Remember to apply a floor function to this result for unspanned Unspanned: bFactor(R) = BlockSize / RecordSize(R)  
Number of Blocks required to store the table  nBlocks(R) = nTuples(R) / bFactor(R) Remember to apply a ceiling function to this result  
Number of Levels of an Index  nLevels_{A}(I)  
Number of blocks required for first index level  nLfBlocks_{A}(I)  
Number of distinct values of an attribute  nDistinct_{A}(I)  
Selection Cardinality (avg. number of records satisfied on equality) (Assuming uniform distribution of distinct values)  For a nonkey: SC_{A}(R) = nTuples(R) / nDistinct_{A}(R) For a key: SC_{A}(R) = 1  
Selection Cardinality for nonequality (Assuming uniform distribution of distinct values)  For inequality (A > c): SC_{A}(R) = nTuples(R) * (( max_{A}(R)  c) / (max_{A}(R)  min_{A}(R) ) For inequality (A < c) According to the textbook: SC_{A}(R) = nTuples(R) * (( c  max_{A}(R) ) / (max_{A}(R)  min_{A}(R) ) Perhaps this is better: SC_{A}(R) = nTuples(R) * (( c  min_{A}(R) ) / (max_{A}(R)  min_{A}(R) ) For conjunctive conditions (A B): SC_{AB}(R) = SC_{A}(R) * SC_{B}(R) For disjunctive conditions (A B): SC_{AB}(R) = SC_{A}(R) + SC_{B}(R)  SC_{A}(R) * SC_{B}(R) 
conditions  Connolly/Begg notation 

Nonequality condition or Equality on a nonkey: Full table scan  nBlocks(R) 
Equality condition on a key in unordered data: Linear search  nBlocks(R) / 2 
Equality condition on a key in ordered data: Binary Search:  log_{2}nBlocks(R) 
Equality condition on a key in using an Index:  nLevels_{A}(I) + 1 
Equality condition on a nonkey or Nonequality condition Retrieve multiple records using a multilevel index: (assume 1/2 of the records match the condition)  nLevels_{A}(I) + nBlocks(R) / 2 
Equality condition on a nonkey using a clustering index:  nLevels_{A}(I) + SC_{A}(R) / bFactor(R) 
Equality condition on a nonkey using secondary index: because in the worst case, each record may reside on a different block.  nLevels_{A}(I) + SC_{A}(R) 
Description  Connolly/Begg notation 

EMPLOYEE table has 10,000 records with a blocking factor of 5.  nTuples(E) = 10000 bFactor(E) = 5 nBlocks(E) = 2000 
Assume a Secondary index on the key attribute SSN  nLevels_{SSN}(E.SSN) = 4 SC_{SSN}(E) = 1 
Assume a Secondary index on nonkey attribute DNO  nLevels_{DNO}(E.DNO) = 2 nLfBlocks_{DNO}(E.DNO) = 4 
There are 125 distinct values of DNO  nDistinct_{DNO}(E) = 125 
The selection cardinality of DNO is  SC_{DNO}(E) = nTuples(E) / nDistinct_{DNO}(E) = 80 
1. Cost of a Full table Scan  nBlocks(E) = 2000 blocks 
2. Average Cost (no index) Linear search  bBlocks(E) / 2 = 1000 blocks 
3. Cost when using a secondary index on SSN  nLevels_{SSN}(E.SSN) + 1 = 5 blocks 
1. Cost of a Full table Scan  nBlocks(E) = 2000 blocks 
2. Rough estimate using the secondary index  nLevels_{DNO}(E.DNO) + SC_{DNO}(E) = 82 
1. Cost of a Full table Scan  nBlocks(E) = 2000 blocks 
2. Rough estimate using the secondary index  nLevels_{DNO}(E.DNO) + SC_{DNO}(E) = 9,680 blocks
Where SC_{DNO}(E) is estimated using the formula for inequality (A < c): SC_{DNO}(E) = nTuples(E) * (( max_{DNO}(E)  c ) / (max_{DNO}(E)  min_{DNO}(E) ) = 10,000 * (( 125  5) / (125  1) ) = 9677.419 = 9,678 records 
This needs to be done in two steps: _{salary > 30000} ( _{dno > 4} (E)) or
TEMP < _{dno > 4} (E)
_{salary > 30000} (TEMP)
1. Cost of a two Full table Scans  nBlocks(E) + nBlocks(TEMP) <= 4000 blocks 
2. Rough estimate using the secondary index followed by full table scan of TEMP  ( nLevels_{DNO}(E.DNO) + SC_{DNO}(E) ) + nBlocks(TEMP) <= 2082 
Variable  Connolly/Begg notation  

Number of blocks for each relation  nBlocks(R) nBlocks(S)  
Number or tuples for each relation  nTuples(S)  
Blocking factor of the joined relation T Block size divided by the Sum of the record sizes of R and S  bFactor(T)  
Size of cartesian product of R and S  nTuples(R) * nTuples(s) nBlocks(R) * nBlocks(S)  
Join Cardinalities (Connolly/Begg) Join R and S resulting in T R _{R.A = S.B} S 
If A is a key of R then nTuples(T) <= nTuples(S) If B is a key of S then nTuples(T) <= nTuples(R)
Otherwise, estimate as 
Join type  Connolly/Begg 

Nested loop join  nBlocks(R) + ( nBlocks(R) * nBlocks(S) ) + [nTuples(T) / bFactor(T)] 
Indexed Join Assuming an index on S.B  nBlocks(R) + [nTuples(R) * (nLevels_{B}(I) + SC_{B}(S) ) ] + [nTuples(T) / bFactor(T)] 
SortMerge Join (assuming tables are already sorted on the same attributes)  nBlocks(R) + nBlocks(S) + [nTuples(T) / bFactor(T)] 
Cost to sort relation R Cost to sort relation S  nBlocks(R) * [log_{2}nBlocks(R)] nBlocks(S) * [log_{2}nBlocks(S)] 
Description  Connolly/Begg 

EMPLOYEE table has 10,000 records with a blocking factor of 5.  nTuples(E) = 10000 bFactor(E) = 5 nBlocks(E) = 2000 
Assume a Secondary index on the key attribute SSN  nLevels_{SSN}(E.SSN) = 4 SC_{SSN}(E) = 1 
Assume a Secondary index on nonkey attribute DNO  nLevels_{DNO}(E.DNO) = 2 nLfBlocks_{DNO}(E.DNO) = 4 
There are 125 distinct values of DNO  nDistinct_{DNO}(E) = 125 
The selection cardinality of DNO is  SC_{DNO}(E) = nTuples(E) / nDistinct_{DNO}(E) = 80 
Department table has 125 records with blocking factor 10  nTuples(D) = 125 bFactor(D) = 10 nBlocks(D) = 13 
There is a single level primary index on DNUMBER  nLevels_{DNUMBER}(D.DNUMBER) = 1 
There is a secondary index on MGRSSN  nLevels_{MGRSSN}(D.MGRSSN) = 2 SC_{MGRSSN}(D) = 1 
Join Selection Cardinality estimate (DNUMBER is a key of DEPARTMENT)  nTuples(T) <= nTuples(E) = 10000 
Blocking factor for the resulting joined table  bFactor(ED) = 4 
Join query:
SELECT employee.*, department.* FROM employee, department WHERE employee.dno = department.dnumber
Join Access Routine  Formulas and Notations 

Use nested loop with EMPLOYEE on the outside 
nTuples(T) = 10,000 bFactor(T) = bFactor(ED) = 4 nBlocks(E) + (nBlocks(E) * nBlocks(D) + [nTuples(T) / bFactor(T)] = 2000 + (2000 * 13) + [ 10000 / 4 ] = 30,500 
Use nested loop with DEPARTMENT on the outside 
nBlocks(D) + (nBlocks(E) * nBlocks(D) ) + [nTuples(T) / bFactor(T)] = 13 + (2000 * 13) + [ 10000 / 4 ] = 28,513 
Use index structure on DNUMBER with EMPLOYEE on the outside 
nBlocks(E) + (nTuples(E) * (nLevels_{DNUMBER}(D.DNUMBER) + 1)) + [nTuples(T) / bFactor(T)] = 2000 + (10000 * 2) + [ 10000 / 4 ] = 24,500 
Use index structure on DNO with DEPARTMENT on the outside 
nBlocks(D) + (nTuples(D) * (nLevels_{DNO}(E.DNO) + SC_{DNO}(E)) + [nTuples(T) / bFactor(T)] = 13 + (125 * (2 + 80)) + [ 10000 / 4 ] = 12,763 
SortMerge Join 
Sort: [nBlocks(E) * log_{2}nBlocks(E)] + [nBlocks(D) * log_{2}nBlocks(D)] = 2000 * 11 + 13 * 4 = 22,052 Join: nBlocks(E) + nBlocks(D) + [nTuples(T) / bFactor(T)] = 2000 + 13 + [ 10000 / 4 ] = 4,513 Total cost: 22052 + 4513 = 26,565 
SELECT department.*, employee.* FROM department, employee WHERE department.mgrssn = employee.ssnAssume an employee may manage at most one department. So join cardinality = 125 (nTuples(T))
nBlocks(D) + (nBlocks(D) * nBlocks(E)) + [nTuples(T) / bFactor(T)]
= 13 + (13 * 2000) + [ 125 / 4 ]
= 13 + 26000 + [125 / 4]
= 26,045
Note: This schema is different from the one you have for your homework assignment
SELECT EMPLOYEE.FNAME, EMPLOYEE.LNAME, PROJECT.PNAME FROM EMPLOYEE, WORKS_ON, PROJECT WHERE EMPLOYEE.SSN = WORKS_ON.ESSN AND WORKS_ON.PNO = PROJECT.PNUMBER AND WORKS_ON.HOURS > 10 AND PROJECT.DNUM = 5 ORDER BY EMPLOYEE.LNAME ;
Table Name  # of Records  Record Size  Blocking Factor  # of Blocks  Indexes  Selection Card. 

EMPLOYEE  nTuples(E) = 8  100 bytes  bFactor(E) = 2  nBlocks(E) = 4  nLevels_{e.ssn} = 2  SC_{ssn}(E) = 1 
PROJECT  nTuples(P) = 6  55 bytes  bFactor(P) = 3  nBlocks(P) = 2  nLevels_{p.dnum} = 1  SC_{dnum}(P) = nTuples(P) / nDistinct_{dnum}(P) = 6/3 = 2 
WORKS_ON  nTuples(W) = 16  35 bytes  bFactor(W) = 5  nBlocks(W) = 4  no indexes  SC_{hours}(W) = nTuples(W) / nDistinct_{hours}(P) = 16/9 = 2 
The result of this step is a temporary table we will call P1 with the
following characteristics.
Note the record size remains the same, only the number of records has been reduced
according to our estimate of SC_{dnum}(P)
Table Name  # of Records  Record Size  Blocking Factor  # of Blocks 

P1  nRecords(P1) = 2  55 bytes  bFactor(P1) = 3  nBlocks(P1) = 1 
Temporary table P1 has no indexes associated with it (as is the case iwth all temporary tables).
The result of this step is a temporary table we will call W1 with the
following characteristics.
Note the record size remains the same, only the number of records has been reduced.
Our rough estimate is that 1/2 of the WORKS_ON records will have HOURS > 10):
Table Name  # of Records  Record Size  Blocking Factor  # of Blocks 

W1  nRecords(W1) = 8  35 bytes  bFactor(W1) = 5  nBlocks(W1) = 2 
Temporary table W1 has no indexes associated with it.
At this point, the two temporary tables W1 and P1 would look like the following (note the DBMS does not actually know this at the planning stage; this is just for illustrative purposes only):


We need to know what the blocking factor of the resulting table will be.
Since we are joining these two tables, the resulting temporary
table (we will call W1P1) will have all of the columns from both tables, so we need to
add up the sizes of their respective records:
P1 record size is 55 bytes + W1 record size is 45 bytes so W1P1 record size will be 100 bytes
Thus given a block size of 200 bytes, bFactor(W1P1) = 2
We also need to know join selection cardinality for the new joined table.
Since PROJECT.PNUMBER is the key of the PROJECT table, we can make use of the following estimation:
If PNUMBER is a key of P1 then nTuples(W1P1) <= nTuples(W1)
We therefore estimate nTuples(W1P1) = nTuples(W1) = 8.
This is an upper bound, not an exact figure.
In this step, it matters which table we put on the outside loop.
Putting P1 on the outside loop (method b) gives the best performance: Cost is 7 blocks (Step 3)
The result of this step is a temporary table we will call W1P1 with the following characteristics:
Table Name  # of Records (upper bound)  Record Size  Blocking Factor  # of Blocks 

W1P1  nTuples(W1P1) = 8 (estimate)  100 bytes  bFactor(W1P1) = 2  nBlocks(W1P1) = 4 
This W1P1 table would actually look like:
ESSN PNO HOURS PNAME PNUMBER PLOCATION DNUM 123456789 1 32.5 ProductX 1 Bellaire 5 666884444 3 40.0 ProductZ 3 Houston 5 453453453 1 20.0 ProductX 1 Bellaire 5 453453453 2 20.0 ProductY 2 Sugarland 5
We need to know what the blocking factor of the resulting table will be.
Since we are joining these two tables, the resulting temporary
table (E1W1P1) will have all of the columns from both tables, so we need to
add up the sizes of their respective records:
W1P1 record size is 100 bytes + E record size is 100 bytes so E1W1P1 record size will be 200 bytes
Thus given a block size of 200 bytes, bFactor(E1W1P1) = 1
We also need to know join selection cardinality for the resulting table.
Since EMPLOYEE.SSN is the key of the PROJECT table, we can make use
of the following estimation:
If SSN is a key of EMPLOYEE then nTuples(E1W1P1) <= nTuples(W1P1)
We therefore estimate nTuples(E1W1P1) = nTuples(W1P1) = 8.
This is an upper bound, not an exact figure.
Therefore, the lowest cost method is to use nested loop (either way) (method a or b) which gives: Cost = 28 blocks (Step 4)
As before, we have a new temporary table called E1W1P1 with the following characteristics:
Table Name  # of Records  Record Size  Blocking Factor  # of Blocks 

E1W1P1  nTuples(E1W1P1) = 8  200 bytes  bFactor(E1W1P1) = 1  nBlocks(E1W1P1) = 8 
The E1W1P1 table would look like this (all of the columns in EMPLOYEE would also be in there):
ESSN PNO HOURS PNAME PNUMBER PLOCATION DNUM FNAME MINIT LNAME SSN ... 123456789 1 32.5 ProductX 1 Bellaire 5 John B Smith 123456789 666884444 3 40.0 ProductZ 3 Houston 5 Ramesh K Narayan 666884444 453453453 1 20.0 ProductX 1 Bellaire 5 Joyce A English 453453453 453453453 2 20.0 ProductY 2 Sugarland 5 Joyce A English 453453453
1 Select only those records where PROJECT.DNUM = 5 2 blocks 2 Select only those records where WORKS_ON.HOURS > 10 4 blocks 3 Join the WORKS_ON and PROJECT where WORKS_ON.PNO = PROJECT.PNUMBER 7 blocks 4 Join EMPLOYEE and WORKS_ON where EMPLOYEE.SSN = WORKS_ON.ESSN 28 blocks 5 Sort the final temp table on the LNAME column 24 blocks 6 Project the FNAME, LNAME and PNAME columns 0 blocks Total estimated cost for the query plan (using steps in this order) 65 blocksThis represents the total cost, in blocks, give the specific order of operations: 1, 2, 1, 2, Sort, Projection
Thus the above steps should then be repeated given other orders of operations. In fact, all of the possible plans (ordering of the operations) should be tried until the ordering of steps that produces the least total cost is found.
In practice, it is generally most efficient if all of the simple selections are done first. Followed by the joins in later steps.
Now look at the optimization done using the Heuristic (rule based) approach:




SELECT /*+ hint */ colA, colB, colC, ... FROM tab1, tab2, ...
SELECT /*+ FIRST_ROWS */ fname, lname, salary, dno FROM employee WHERE salary > 39000;
SELECT /*+ RULE */ fname, lname, salary, dno FROM employee WHERE salary > 39000;
Note: This material from the MS SQL Server 7.0 Books Online
In MS SQL Server, query hints can be added using an OPTIONS clause at the end of the statement:
Some of the query options available are:
Assume the following query on the Northwind example database: