Database Management Systems II - Prof. Holowczak
Zicklin School of Business - Baruch College
City University of New York
Database Management Systems II
What You'll Learn This Week
Distributed Database Architecture
Distributed Database Architecture
|Elmasri/Navathe (3rd ed.) || Connolly and Begg (3rd ed.) ||McFadden (6th ed.)
|Chapter 24 ||Chapters 22 and 23 ||Chapter 13
- Database is distributed at the DBMS level.
- In a distributed database system (DDS),
multiple Database Management Systems run on
multiple servers (sites or nodes) connected by a
- There are four main dimensions on which
DDBMS are classified:
- Data Distribution (Data fragmentation and/or replication)
- Degree of homogeneity/heterogeneity
- Degree of autonomy
- Degree of distribution transparency
Data Distribution - Data Fragmentation and Replication
- Data may be split up among the different
nodes or it may be replicated.
- Tables may be located on different nodes
connected by a network.
- Data may be split up (or fragmented) in
- Horizontal: Rows in a table are
split up across multiple nodes.
- Vertical: Columns in a table are
split across multiple nodes.
- Both vertical and horizontal.
- Splitting up data can improve performance by
reducing contention for tables.
|1001 ||Mr. Smith ||123 Lexington ||Smithville ||KY ||91232
|1002 ||Mrs. Jones ||12 Davis Ave. ||Smithville ||KY ||91232
|1003 ||Mr. Axe ||443 Grinder Ln.||Broadville ||GA ||81992
|1004 ||Mr. Builder ||661 Parker Rd.||Streetville ||GA ||81990
Customer ID Name Address City State Zip
1001 Mr. Smith 123 Lexington Smithville KY 91232
1002 Mrs. Jones 12 Davis Ave. Smithville KY 91232
Customer ID Name Address City State Zip
1003 Mr. Axe 443 Grinder Ln. Broadville GA 81992
1004 Mr. Builder 661 Parker Rd. Streetville GA 81990
1001 Mr. Smith
1002 Mrs. Jones
1003 Mr. Axe
1004 Mr. Builder
CustID Address City State Zip
1001 123 Lexington Smithville KY 91232
1002 12 Davis Ave. Smithville KY 91232
1003 443 Grinder Ln. Broadville GA 81992
1004 661 Parker Rd. Streetville GA 81990
- Data may also be replicated across
- Improve performance by moving a copy of
data closer to the users.
- Improve reliability - if one node fails,
others can continue processing the
Degree of heterogeneity/homogeneity
- Distributed databases may be made up of
homogeneous (similar) nodes or some mix
of heterogeneous (different) nodes.
- Assumption to make is that replacing existing systems is
not feasible (economically, technically, politically).
Degree of Autonomy
- Each site or node may have some degree of
that is, it can accept transactions
locally while still participating as a
node in the distributed database system.
- In the above figure of a Federated
DDBMS, each department maintains its
own local applications and databases.
Transactions are submitted directly from
the local application to the local DBMS
(the solid arrows) indicating local
- However, other applications such as the
EIS, can run federated transactions that
access data from each of the component
systems (the dashed lines).
Degree of Distribution Transparency
- The third dimension concerns the degree of
distribution transparency (schema
- Through a process called schema
integration, we can provide a unified
view of the DDBMS such that it appears as
a single schema to all applications (and
users). This would be a high degree of
- If, however, applications and users must
specify the specific locations of all
data items, then this would be considered
a low degree of schema integration.
- There are many problems encountered with
providing such an integrated schema (see
notes above on heterogeneity) including
the problem of naming.
- In general, distributed database systems can
more flexibility, higher performance and
greater levels of independence over
- However, distributed database systems are also
much harder to design and develop, control and
administrate. Security is also more difficult
- For more on how Oracle8 manages distributed
databases, look into the
Oracle8i Server Distributed Database Systems Release 8i
DDBMS Query Processing and Optimization
- Recall that in a DDBMS data can be fragmented
or replicated across several sites.
- Whereas in a centralized DBMS environment we
are mostly concerned with Access Costs, in a
DDBMS environment, we are primarily concerned
with Network costs - the cost to ship
data from one node to another to satisfy a
query (specifically in joins). Cost is given
in bytes to be transferred.
- This query will return 10,000 records (every
employee belongs to one department). Each
record will be 40 bytes long (FNAME + LNAME +
DNAME = 15 + 15 + 10 = 40).
Thus the result set will be 400,000 bytes.
- Assume cost to transfer query text between nodes can be safely ignored.
- Three alternatives:
- Copy all EMPLOYEE and DEPARTMENT records
to node 3. Perform the join and display
Total Cost = 1,000,000 + 3,500 =
- Copy all EMPLOYEE records (1,000,000
bytes) from node 1 to node 2. Perform the
join, then ship the results (400,000
bytes) to node 3.
Total cost = 1,000,000 + 400,000 =
- Copy all DEPARTMENT records (3,500) from
node 2 to node 1. Perform the join. Ship
the results from node 1 to node 3
Total cost = 3,500 + 400,000 = 403,500
- Another alternative is to implement a
Semijoin - ship around only the columns
required to perform the join.
- Taking the same example:
- Copy just the FNAME, LNAME and DNO
columns from node 1 to node 3 (cost = 34
bytes times 10,000 records = 340,000
- Copy just the DNUMBER and DNAME columns
from node 2 to node 3 (cost = 14 bytes
times 100 records = 1,400 bytes)
- Perform the join at node 3 and display
Total cost = 341,400
- Other options with semijoin are to transfer
just the columns to be joined around to the
sites, then go back and fetch the rest of the
columns to be projected in the results.
DDBMS Concurrency Control
- Assume we have replicated our data across
three nodes. How can we implement concurrency
control ? Where should the locking take place
- One copy of each data item (table, row, etc.)
is designated as the Distinguished
- This copy of the data item is where all locks
are applied. The concurrency controller should
look to this copy to determine if any locks
- Several variations of distinguished copy:
- All distinguished copies reside on the
same node: Primary Site
- Distinguished copies may reside on
different sites: Primary Copy
- There are problems with this approach:
- What if the primary site fails or if the
node where the primary copy resides
- One solution is to designate a
backup site that will
automatically take over.
- Another solution is to "elect" a new
- Another overall approach (aside from
distinguished copy), is to use a Voting
protocol: To lock a data item:
- Send a message to all nodes that maintain
a replica of this item.
- If a node can safely lock the item, then
vote "Yes", otherwise, vote "No".
- If a majority of participating nodes vote
"Yes" then the lock is granted.
- Send the results of the vote back out to
all participating sites.
- With the voting approach, nodes can fail and
recover while allowing transaction processing
DDBMS Transaction Processing
- We need mechanisms in place to ensure multiple
copies of data are kept consistent.
- Concurrency and Commit protocols must be
changed to account for replicated data.
- Recall in a centralized DB we had the notion
of a commit point. In distributed DB, we need
to consider committing a transaction that
changes data on multiple nodes.
- Distributed Commit Protocol such as Two Phase
Commit (2PC). Also called a
synchronous replication protocol.
- Phase 1: Send a message to all
nodes: "Can you commit Transaction X?"
All nodes that can commit this
transaction reply with "Yes".
- Phase 2: If all nodes reply with
"Yes", then send a "Commit" message to all
If any node replies "No", then the
transaction is aborted.
- 2PC is an example of a synchronous
- In Asynchronous replication, we take
snapshots of a master database and
the changes to other nodes on some periodic
- For example, see
Oracle8i Server Replication Release 8.1.7
File: ddb_index.html Date: Mon Dec 2 10:18:41 EST 2002
All materials Copyright, 1997-2002 Richard Holowczak