Bài giảng Database systems - Data storage & indexing structures for files
Exercise (2)
1. Calculate the record size R in bytes.
2. Calculate the blocking factor bfr and the number of
file blocks b, assuming an unspanned organization.
3. Suppose that the file is ordered by the key field Ssn
and we want to construct a primary index on Ssn.
Calculate:
a. The index blocking factor bfri.
b. the number of first-level index entries and the number of
first-level index blocks.
4. If we make it into a multilevel index (two levels).
a. Calculate the total number of blocks required by the multilevel index.
b. the number of block accesses needed to search for and
retrieve a record from the file—given its Ssn value.
94 trang |
Chia sẻ: vutrong32 | Lượt xem: 1669 | Lượt tải: 1
Bạn đang xem trước 20 trang tài liệu Bài giảng Database systems - Data storage & indexing structures for files, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
DATABASE SYSTEMS
Nguyen Ngoc Thien An
DATA STORAGE &
INDEXING STRUCTURES FOR FILES
Spring 2014
2
Physical Database Design
3
The process of physical database design
involves choosing the particular data
organization techniques that best suit the
given application requirements.
The techniques used to store large amounts of
structured data on disk are important for
database designers, the DBA, and
implementers of a DBMS.
Contents
4
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Reading Suggestion: [1] Chapter 17, 18
Storage Hierarchy
5
Primary
Storage
Registers
- Can be operated on
directly by the
computer’s CPU.
- Volatile.
- Faster access.
- Smaller capacity.
- More expensive. Cache Memories
(SRAM)
Main Memories
(DRAM)
Secondary
Storage
Hard-disk Drives
(HDD), Solid-
state Drives
(SSD)
- Cannot be processed
directly by the CPU.
First it must be copied
into primary storage and
then processed by the
CPU.
- Non-volatile.
Tertiary
Storage
Optical Disks,
Tape Jukeboxes,
Magnetic Tapes,
Tape Libraries
- Slower access.
- Larger capacity.
- Cheaper.
Contents
6
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Contents
7
Storage of Databases
Introduction
Magnetic Disks
Magnetic Tapes
Records & Files
Record Blocking
Typical Operations on Files
Record Organization of Files
Unordered Files
Ordered Files
Hash Files
Static Hashing
Hashing for Dynamic Files
Storage of Databases (1)
8
Most databases are large and stored permanently.
Databases usually reside on secondary and
tertiary storage because:
Databases are too large to fit entirely in main memory.
Volatility of primary storage devices.
The high cost of primary storage devices per unit of
data.
Storage of Databases (2)
9
Databases are stored physically as files of records.
Each record is a collection of data values.
The values can be interpreted as facts about entities, their
attributes, and their relationships.
Objective: locating records efficiently when they are
needed.
Portions of the database are read into and written
from buffers in main memory as needed.
File organizations:
Primary organization: the physical arrangement of file
records on the disk.
Secondary organization (or auxiliary access structure):
allows efficient access to file records based on alternate
fields than those used for the primary file organization (E.g.
indexes).
Storage of Databases (3)
10
Magnetic disks
For online databases (accessed and processed
frequently).
Can be accessed directly at any time.
Magnetic tapes
For offline databases (backing up).
Lower cost than magnetic disks.
Access is quite slow an automatic loading
device to load a tape is needed before the data
becomes available.
Contents
11
Storage of Databases
Introduction
Magnetic Disks
Magnetic Tapes
Records & Files
Record Blocking
Typical Operations on Files
Record Organization of Files
Unordered Files
Ordered Files
Hash Files
Static Hashing
Hashing for Dynamic Files
Magnetic Disks (1)
12
Magnetic disk
Shaped as a thin circular disk.
Made of magnetic material.
Data stored as magnetized areas on surfaces.
Single-sided vs. Double-sided.
Protected by a plastic or acrylic cover.
Disk pack
Contain several magnetic disks connected to a rotating spindle.
Mounted in the disk drive.
A read-write head
For each surface.
Attached to a mechanical arm.
All arms are connected to an actuator.
A disk is a random access addressable device.
Magnetic Disks (2)
13
14
Magnetic Disks (3)
15
Disks are divided into tracks on each disk surface.
Tracks: concentric circles of small width.
Cylinder: include tracks with the same diameter on the various
surfaces.
Data stored on one cylinder can be retrieved much faster than if it
were distributed among different cylinders.
A track is divided into smaller blocks or sectors.
The division into sectors:
Hard-coded (cannot be changed).
Have many types of sector organization.
The division into blocks:
A track is divided into equal-sized blocks.
Set by the OS during disk formatting.
The block size B is fixed for each system.
Typical block sizes: from 512 bytes to 8192 bytes.
Transfer of data between main memory and disk takes place in units
of disk blocks.
Magnetic Disks (4)
16
Magnetic Disks (5)
17
Locate and transfer an arbitrary block, given its
address:
Position the read/write head on the correct track (Seek
time).
Disk rotation moves the block under the read-write head.
The total time = seek time + rotational delay + block
transfer time.
The seek time and rotational delay are usually much larger
than the block transfer time.
Double buffering can be used to speed up the transfer of
contiguous disk blocks.
A physical disk block (hardware) address consists of:
a cylinder number.
a track number or surface number (within the cylinder).
block number (within track).
18
Contents
19
Storage of Databases
Introduction
Magnetic Disks
Magnetic Tapes
Records & Files
Record Blocking
Typical Operations on Files
Record Organization of Files
Unordered Files
Ordered Files
Hash Files
Static Hashing
Hashing for Dynamic Files
Magnetic Tapes
20
Magnetic tapes are sequential access devices.
Data is stored on reels of high-capacity magnetic
tape.
Tape access can be slow.
Usage
Not used to store online data, except for some specialized
applications.
Used for:
Back-up database.
In case the data is lost due to a disk crash.
Excessively large database files.
Archived database (database files that are seldom used or are
outdated but required for historical record keeping).
21
Contents
22
Storage of Databases
Introduction
Magnetic Disks
Magnetic Tapes
Records & Files
Record Blocking
Typical Operations on Files
Record Organization of Files
Unordered Files
Ordered Files
Hash Files
Static Hashing
Hashing for Dynamic Files
Records & Files
23
Data is stored in the form of records.
Records contain fields which have values of
particular types.
E.g.: amount, date, time, age
Record type: collection of field names and their
corresponding data types.
Fields may have fixed lengths or variable lengths.
Separator characters, field lengths or field type codes
may be needed so that records can be “parsed”.
A file is a sequence of records.
A file descriptor (or file header) includes information
describing the file.
A file can be made up of fixed-length records or
variable-length records.
24
Record Blocking (1)
25
Techniques of the disk block allocation for a file may be combined:
Contiguous allocation.
Linked allocation.
Clusters / File segments / Extents (linked clusters of consecutive disk
blocks).
Indexed allocation.
File records can be unspanned or spanned:
Unspanned: no record can span two blocks.
Usually used for files of fixed-length records.
Spanned: a record can be stored in more than one block.
Usually used for files of variable-length records.
Files of variable-length records require additional information stored in each
record, such as separator characters and field types.
The blocking factor bfr: the (average) number of records per
block.
The block size may be larger or smaller than the record size.
Record Blocking (2)
26
Typical Operations on Files
27
OPEN: Prepare the file for access, and associates a pointer refering to a
current file record at each point in time.
FIND: Search for the first file record satisfying a certain condition, and
make it the current file record.
FINDNEXT: Search for the next file record (from the current record)
satisfying a certain condition, and make it the current file record.
READ: Read the current file record into a program variable.
INSERT: Insert a new record into the file and make it the current file record.
DELETE: Remove the current file record from the file.
MODIFY: Change the values of some fields of the current file record.
CLOSE: Terminate access to the file.
REORGANIZE: Reorganize the file records.
For example, the records marked deleted are physically removed from the file or
a new organization of the file records is created.
READ_ORDERED: Read the file blocks in order of a specific field.
Contents
28
Storage of Databases
Introduction
Magnetic Disks
Magnetic Tapes
Records & Files
Record Blocking
Typical Operations on Files
Record Organization of Files
Unordered Files
Ordered Files
Hash Files
Static Hashing
Hashing for Dynamic Files
Record Organization of Files
29
Types of record organization in a file:
Unordered file.
Ordered file.
Hash file.
Unordered Files
30
Also called a heap file or a pile file.
Insertion: New records are inserted at the end of the file
very efficient.
Search for a record: use a linear search.
Reading and searching half the file blocks on the average
quite expensive.
Deletion: rewrite the block or use deletion marker Require
one of operations:
Periodic file reorganization to reclaim the unused space.
Using the space of deleted records for inserting.
Modifying a variable-length record may require deleting the
old record and inserting a modified record.
Reading the records in order of a particular field requires
sorting the file records.
Ordered Files
31
Also called a sequential file.
The physical order of records in a file is based on the values
of an ordering field.
Search for a record on its ordering field: use a binary search.
Access log2 of the file blocks on the average.
Reading the records in order of the ordering field: quite
efficient.
Insertion and deletion: expensive.
Insertion: improve efficiency by one of techniques:
Keep some unused space in each block for new records.
Keep a temporary unordered file (overflow / transaction file) for new
records periodically merged with the main ordered file.
Deletion: use deletion markers and periodic reorganization.
Modification: depend on the search condition and the
modified field.
32
Average Access Times
33
Contents
34
Storage of Databases
Introduction
Magnetic Disks
Magnetic Tapes
Records & Files
Record Blocking
Typical Operations on Files
Record Organization of Files
Unordered Files
Ordered Files
Hash Files
Static Hashing
Hashing for Dynamic Files
Hash Files
35
Hashing for disk files is called External Hashing.
The target address space is made of buckets.
Each bucket holds multiple records.
A bucket is either one disk block or a cluster of contiguous disk
blocks.
A record is stored in bucket i: 𝑖 = ℎ(𝐾).
h: a hash function.
K: the hash key value of a record.
Search is very efficient on the hash key.
A collision occurs when a new record hashes to a bucket
that is already full.
Good hash function: distribute records uniformly over the
address space
Minimize collisions (decrease search time) and not leave many
unused locations.
A hash file should be kept 70 - 90% full.
Static Hashing (1)
36
A fixed number of buckets M is allocated.
The file blocks are divided into M equal-sized buckets,
numbered bucket0, bucket1, ..., bucketM-1.
Serious drawback for dynamic files:
Fixed number of buckets M is a problem if the number of
records in the file grows or shrinks.
May have to change M and the hashing function quite
time-consuming for large files.
Collision resolution:
An overflow file is kept for storing such records.
Overflow records that hash to each bucket can be
linked together.
Static Hashing (2)
37
Static Hashing (3)
38
Hashing for Dynamic Files
39
Hashing techniques for expanding or shrinking
a file dynamically:
Extendible hashing: store an access structure in
addition to the file.
Dynamic hashing: use an access structure
based on binary tree data structures.
Linear hashing: not require additional access
structures.
Extendible and dynamic hashing do not
require an overflow area.
Extendible Hashing
40
𝒅: the global depth of the directory.
Directory: an array of 2𝑑 bucket addresses.
The first (high-order) 𝒅 bits of a hash value determine a
directory entry.
The address in each directory entry determines the bucket.
𝒅′: the local depth stored with each bucket.
Specify the number of bits on which the bucket contents
are based.
Some directory entries with the same first 𝒅’ bits for their
hash values may contain the same bucket address if all
their records fit in a single bucket.
Doubling d occurs if a bucket having 𝑑’ = 𝑑 overflows.
Halving d occurs if 𝑑 > 𝑑’ for all the buckets.
Most record retrievals require 2 block accesses - one
to the directory and the other to the bucket.
41
Dynamic Hashing
42
The eventual storage of records in buckets is
similar to extendible hashing.
Directory: use a binary tree.
Internal nodes having 2 pointers
Left pointer for the 0 bit (in the hashed address).
Right pointer for the 1 bit (in the hashed address).
Leaf nodes: hold a pointer to the actual bucket.
43
Linear Hashing (1)
44
No directory but maintain overflow area(s) for collisions.
At first:
𝑛 = 0; 𝑀 buckets numbered 0, 1, , 𝑀 − 1
Initial hash function: ℎ𝑖(𝐾) = 𝐾 𝑚𝑜𝑑 𝑀
When a collision leads to an overflow record in any bucket:
Split bucket 𝑛 into 2 buckets: 𝑛 (original) & 𝑀 + 𝑛 (new).
Redistribute records of the original bucket into the 2 buckets
based on ℎ𝑖+1(𝐾) = 𝐾 𝑚𝑜𝑑 2𝑀.
𝑛 = 𝑛 + 1
Buckets are split in the linear order 0, 1, 2, 3, ....
When 𝑛 = 𝑀:
All the original buckets have been split the records in overflow
are eventually redistributed into regular buckets.
The file now has 2𝑀 instead of 𝑀 buckets.
All buckets use ℎ𝑖+1 𝑛 is reset to 0.
Linear Hashing (2)
45
Generally, a sequence of hashing functions is used:
ℎ𝑖+𝑗 𝐾 = 𝐾 𝑚𝑜𝑑 (2
𝑗𝑀), where 𝑗 = 0, 1, 2,
Retrieve a record with hash key value K:
Apply ℎ𝑖(𝐾).
If ℎ𝑖 𝐾 < 𝑛 apply ℎ𝑖+1(𝐾).
The file load factor can be used to trigger splits and
combinations.
The file load factor 𝑙 = 𝑟/(𝑏𝑓𝑟 ∗ 𝑁)
𝑟: the current number of file records.
𝑏𝑓𝑟: the maximum number of records that can fit in a bucket.
𝑁: the current number of file buckets.
Split when 𝑙 is larger than a certain threshold (instead of
whenever an overflow occurs).
Recombine when 𝑙 falls below a certain threshold.
Contents
46
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
RAID Technology (1)
47
Secondary storage technology must take steps
to keep up in performance and reliability with
processor technology.
Parallelizing disk access using RAID technology
(Redundant Arrays of Independent Disks).
The main goal of RAID is to even out the
widely different rates of performance
improvement of disks against those in memory
and microprocessors.
RAID Technology (2)
48
A large array of small independent disks acts as a
single higher-performance logical disk.
Data striping:
Distribute data transparently over multiple disks to
make them appear as a single large, fast disk.
Utilize parallelism to improve disk performance.
RAID Technology (3)
49
Different RAID organizations were defined based on different
combinations of the two factors of granularity of data interleaving
(striping) and pattern used to compute redundant information.
RAID level 0: no redundant data the best write performance at the
risk of data loss.
RAID level 1 uses mirrored disks.
RAID level 2 uses memory-style redundancy by using Hamming codes
(contain parity bits for distinct overlapping subsets of components).
Level 2 includes both error detection and correction.
RADI level 3 uses a single parity disk relying on the disk controller to
figure out which disk has failed.
RAID levels 4 and 5 use block-level data striping, with level 5 distributing
data and parity information across all disks.
RAID level 6 applies the so-called P + Q redundancy scheme using
Reed-Soloman codes to protect against up to two disk failures by using
just two redundant disks.
50
Contents
51
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Storage Area Networks (1)
52
Organizations have a need to move from a static
fixed data center oriented operation to a more
flexible and dynamic infrastructure for information
processing.
Storage Area Networks (SANs).
In a SAN:
Online storage peripherals are configured as nodes
on a high-speed network and can be attached and
detached from servers in a very flexible manner.
Allows storage systems to be placed at longer
distances from the servers and provide different
performance and connectivity options.
Storage Area Networks (2)
53
Advantages of SANs:
Flexible many-to-many connectivity among servers
and storage devices using fiber channel hubs and
switches.
Up to 10km separation between a server and a
storage system using appropriate fiber optic cables.
Better isolation capabilities allowing non-disruptive
addition of new peripherals and servers.
SANs face the problem of combining storage
options from multiple vendors and dealing with
evolving standards of storage management
software and hardware.
54
Contents
55
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Book Index
56
Indexes (1)
57
Index structures
Additional auxiliary files.
Provide secondary access paths to records without
affecting their physical placement in the data file.
Enable efficient search based on the indexing fields.
An index is usually specified on one field of the file
and also can be specified on multiple fields.
Multiple indexes on different fields can be
constructed on the same file.
Indexes (2)
58
The index file usually occupies considerably
less disk blocks than the data file because its
entries are much smaller.
Indexes can be dense or sparse.
A dense index has an index entry for every
search key value (and hence every record) in the
data file.
A sparse (or nondense) index has index entries
for only some of the search values.
Contents
59
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Contents
60
Single-Level Ordered Indexes
Primary Indexes
Clustering Indexes
Secondary Indexes
Single-Level Ordered Indexes
61
Form of an index: a file of entries <field value,
pointer to block/record> ordered by field value.
A binary search on the index requires fewer block
accesses than a binary search on the data file.
Types of single-level ordered indexes:
Primary index: on the ordering key field of an ordered file.
Clustering index: on the ordering field (but not key field)
of an ordered file.
Secondary index: on any nonordering field of a file.
A file can have at most one physical ordering field.
It can have at most one primary index or one clustering
index, but not both.
A data file can have several secondary indexes in
addition to its primary access method.
Primary Indexes (1)
62
A primary index is specified on the ordering key field
(primary key) of an ordered data file.
Records in the data file is physically ordered on the
ordering key field.
Index file:
An ordered file of fixed length index entries.
One index entry for each data block:
𝐾 𝑖 : the key field value of the first record in block 𝑖 (called the
block anchor or anchor record).
𝑃(𝑖): a pointer to block 𝑖.
A primary index is a sparse index.
A similar scheme can use the last record in a block.
63
Primary Indexes (2)
64
E.g.: Given the following data file:
EMPLOYEE (Name, Ssn, Address, Job, Sal,...)
Suppose that:
Record size: 𝑅 = 100 𝑏𝑦𝑡𝑒𝑠.
Block size: 𝐵 = 1024 𝑏𝑦𝑡𝑒𝑠.
Number of records: 𝑟 = 30,000 𝑟𝑒𝑐𝑜𝑟𝑑𝑠.
Then, we get:
Blocking factor: 𝑏𝑓𝑟 = 𝐵/𝑅 = 1024/100 = 10 𝑟𝑒𝑐𝑜𝑟𝑑𝑠/𝑏𝑙𝑜𝑐𝑘.
Number of file blocks: 𝑏 = 𝑟/𝑏𝑓𝑟 = 30000/10 = 3000 𝑏𝑙𝑜𝑐𝑘𝑠.
For a primary index on the ordering key field SSN, assume
the field size 𝑉𝑆𝑆𝑁 = 9 𝑏𝑦𝑡𝑒𝑠 , the block pointer size 𝑃𝑅 =
6 𝑏𝑦𝑡𝑒𝑠.
Primary Indexes (3)
65
Then:
Index entry: 𝑅𝑖 = (𝑉𝑆𝑆𝑁 + 𝑃𝑅) = (9 + 6) = 15 𝑏𝑦𝑡𝑒𝑠.
Index blocking factor: 𝑏𝑓𝑟𝑖 = 𝐵/𝑅𝑖 = 1024/15 = 68 𝑒𝑛𝑡𝑟𝑖𝑒𝑠/
𝑏𝑙𝑜𝑐𝑘.
Number of index blocks: 𝑏𝑖 = 𝑟𝑖/𝑏𝑓𝑟𝑖 = 𝑏/𝑏𝑓𝑟𝑖 = 3000/68 =
45 𝑏𝑙𝑜𝑐𝑘𝑠.
Binary search on the index file:
𝑙𝑜𝑔2(𝑏𝑖) = 𝑙𝑜𝑔2(45) = 6 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
To search for a record using the index, we need one additional
block access to the data file:
6 + 1 = 7 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
Compare with the average cost of binary search on the
ordering key field without using the index:
𝑙𝑜𝑔2(𝑏) = 𝑙𝑜𝑔2(3000) = 12 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
Contents
66
Single-Level Ordered Indexes
Primary Indexes
Clustering Indexes
Secondary Indexes
Clustering Indexes
67
A clustering index is defined on the ordering nonkey
field (clustering field) of an ordered data file.
The file records are physically ordered on the clustering
field.
A nonkey field does not have a distinct value for each
record.
The index includes one index entry for each distinct
value of the clustering field.
Each index entry contains the clustering value and a
pointer to the first data block having records of that
clustering value.
A clustering index is a sparse index.
68
69
Contents
70
Single-Level Ordered Indexes
Primary Indexes
Clustering Indexes
Secondary Indexes
Secondary Indexes (1)
71
A secondary index provides a secondary means
of accessing a file for which some primary access
already exists.
The data file records could be ordered, unordered,
or hashed.
The secondary index is specified on nonordering
field which may be a candidate key or a non-key.
Many secondary indexes can be created for the
same file.
The index is an ordered file with two fields.
Secondary Indexes (2)
72
A secondary index on a nonordering key field:
One index entry for each record in the data file.
Include the value of the field and a pointer either
to the block storing the record or to the record
itself.
A dense index.
The improvement in search time for an
arbitrary record is much greater for a
secondary index than for a primary index.
73
Secondary Indexes (3)
74
E.g.: (contitnue the example of Primary Indexes)
We have:
Record size: 𝑅 = 100 𝑏𝑦𝑡𝑒𝑠.
Block size: 𝐵 = 1024 𝑏𝑦𝑡𝑒𝑠.
Number of records: 𝑟 = 30,000 𝑟𝑒𝑐𝑜𝑟𝑑𝑠.
Blocking factor: 𝑏𝑓𝑟 = 10 𝑟𝑒𝑐𝑜𝑟𝑑𝑠/𝑏𝑙𝑜𝑐𝑘.
Number of file blocks: 𝑏 = 3000 𝑏𝑙𝑜𝑐𝑘𝑠.
Primary index on the ordering key field 𝑉1:
Ordering key field size: 𝑉1 = 9 𝑏𝑦𝑡𝑒𝑠.
Block pointer size: 𝑃𝑅 = 6 𝑏𝑦𝑡𝑒𝑠.
Search on 𝑉1 using the index: 7 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
Binary search on 𝑉1: 𝑙𝑜𝑔2(𝑏) = 12 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
Secondary Indexes (4)
75
Secondary index on the nonordering key field 𝑉2:
Ordering key field size: 𝑉2 = 9 𝑏𝑦𝑡𝑒𝑠.
Block pointer size: 𝑃𝑅 = 6 𝑏𝑦𝑡𝑒𝑠.
Index entry: 𝑅𝑖 = (𝑉𝑆𝑆𝑁 + 𝑃𝑅) = (9 + 6) = 15 𝑏𝑦𝑡𝑒𝑠.
Index blocking factor: 𝑏𝑓𝑟𝑖 = 𝐵/𝑅𝑖 = 1024/15 = 68 𝑒𝑛𝑡𝑟𝑖𝑒𝑠/
𝑏𝑙𝑜𝑐𝑘.
Number of index blocks:
𝑏𝑖 = 𝑟𝑖/𝑏𝑓𝑟𝑖 = 𝑟/𝑏𝑓𝑟𝑖 = 30000/68 = 442 𝑏𝑙𝑜𝑐𝑘𝑠.
Binary search on the index:
𝑙𝑜𝑔2(𝑏𝑖) = 𝑙𝑜𝑔2(442) = 9 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
To search for a record using the index, we need one additional
block access to the data file:
9 + 1 = 10 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
Linear search on 𝑉2 : 𝑏/2 = 3000/2 = 1500 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.
Secondary Indexes (5)
76
A secondary index on a nonordering nonkey field:
Option 1: include duplicate index entries with the
same K(i) value – one for each record a dense
index.
Option 2: keep a list of pointers in
the index entry for K(i) - one pointer to each block that
contains a record of K(i).
Option 3: a single entry for each index
field value, the pointer P(i) points to a disk block
including record pointers, each record pointer points
to one of the data file records of value K(i). If some
value K(i) occurs in too many records, a cluster or
linked list of blocks is used nondense.
77
78
Contents
79
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Multi-Level Indexes (1)
80
When single-level index is an ordered file with
a distinct value for each K(i) Can create a
primary index to the index itself.
The original index file is called the first-level index
and the index to the index is called the second-
level index.
We can repeat the process, creating a third,
fourth, ..., top level until all entries of the top level
fit in one disk block.
Multi-Level Indexes (2)
81
A multi-level index can be created for any type of first-level
index (primary, secondary, clustering) as long as the first-level
index consists of more than one disk block.
The blocking factor for the index (𝑏𝑓𝑟𝑖) is called the fan-out of
the multilevel index (𝒇𝒐).
At each search step: n-ways search (𝑛 = 𝑓𝑜).
The number levels of a multi-level index: 𝑡 = 𝑙𝑜𝑔𝑓𝑜(𝑟1)
𝑟1: the number of index entries in the first level index.
Such a multi-level index is a form of search tree.
However, insertion and deletion of new index entries is a severe
problem because every level of the index is an ordered file.
82
Contents
83
Data Storage
Storage Hierarchy
Storage of Databases
RAID Technology
Storage Area Networks
Indexing Structures for Files
Indexes
Single-Level Ordered Indexes
Multi-Level Indexes
Dynamic Multilevel Indexes (B-tree & B+-tree)
Search Tree
84
Dynamic Multilevel Indexes (1)
85
Multilevel indexing has insertion and deletion
problems
use a dynamic multilevel index that leaves some space in
each of its blocks for inserting new entries
use appropriate insertion/deletion algorithms.
These data structures are variations of search trees
that allow efficient insertion and deletion of new
search values.
Most multi-level indexes use B-tree or B+-tree data
structures because of the insertion and deletion
problem.
In B-Tree and B+-Tree data structures, each node
corresponds to a disk block.
Dynamic Multilevel Indexes (2)
86
An insertion into a node that is not full is quite
efficient.
If a node is full the insertion causes a split into
two nodes.
Splitting may propagate to other tree levels.
A deletion is quite efficient if a node does not
become less than half full.
If a deletion causes a node to become less than
half full, it must be merged with neighboring
nodes.
Dynamic Multilevel Indexes (3)
87
Main differences between B-tree and B+-tree:
In a B-tree, pointers to data records exist at all
levels of the tree.
In a B+-tree, all pointers to data records exists at
the leaf-level nodes.
Each leaf node of B+-tree has a pointer to the
next leaf node of the tree.
A B+-tree can have less levels (or higher capacity
of search values) than the corresponding B-tree.
B-tree Structures
B+-tree Structures
89
90
Example of
Insertion in
a B+-tree
91
Example of
Deletion in
a B+-tree
Q & A
92
Exercise (1)
93
Consider a disk with block size B = 512 bytes. A
block pointer is P = 6 bytes long, and a record
pointer is PR = 7 bytes long. A file has r = 30,000
EMPLOYEE records of fixed length. Each record
has the following fields: Name (30 bytes), Ssn (9
bytes), Department_code (9 bytes), Address (40
bytes), Phone (10 bytes), Birth_date (8 bytes), Sex
(1 byte), Job_code (4 bytes), and Salary (4 bytes,
real number).An additional byte is used as a
deletion marker.
Exercise (2)
94
1. Calculate the record size R in bytes.
2. Calculate the blocking factor bfr and the number of
file blocks b, assuming an unspanned organization.
3. Suppose that the file is ordered by the key field Ssn
and we want to construct a primary index on Ssn.
Calculate:
a. The index blocking factor bfri.
b. the number of first-level index entries and the number of
first-level index blocks.
4. If we make it into a multilevel index (two levels).
a. Calculate the total number of blocks required by the
multilevel index.
b. the number of block accesses needed to search for and
retrieve a record from the file—given its Ssn value.
Các file đính kèm theo tài liệu này:
- 7_data_storage_indexing_structures_for_files_3655.pdf