Bài giảng Database systems - Data storage & indexing structures for files

Exercise (2) 1. Calculate the record size R in bytes. 2. Calculate the blocking factor bfr and the number of file blocks b, assuming an unspanned organization. 3. Suppose that the file is ordered by the key field Ssn and we want to construct a primary index on Ssn. Calculate: a. The index blocking factor bfri. b. the number of first-level index entries and the number of first-level index blocks. 4. If we make it into a multilevel index (two levels). a. Calculate the total number of blocks required by the multilevel index. b. the number of block accesses needed to search for and retrieve a record from the file—given its Ssn value.

94 trang | Chia sẻ: vutrong32 | Lượt xem: 1494 | Lượt tải: 1

Bạn đang xem trước 20 trang tài liệu Bài giảng Database systems - Data storage & indexing structures for files, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

DATABASE SYSTEMS Nguyen Ngoc Thien An DATA STORAGE & INDEXING STRUCTURES FOR FILES Spring 2014 2 Physical Database Design 3  The process of physical database design involves choosing the particular data organization techniques that best suit the given application requirements.  The techniques used to store large amounts of structured data on disk are important for database designers, the DBA, and implementers of a DBMS. Contents 4  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Reading Suggestion: [1] Chapter 17, 18 Storage Hierarchy 5 Primary Storage Registers - Can be operated on directly by the computer’s CPU. - Volatile. - Faster access. - Smaller capacity. - More expensive. Cache Memories (SRAM) Main Memories (DRAM) Secondary Storage Hard-disk Drives (HDD), Solid- state Drives (SSD) - Cannot be processed directly by the CPU. First it must be copied into primary storage and then processed by the CPU. - Non-volatile. Tertiary Storage Optical Disks, Tape Jukeboxes, Magnetic Tapes, Tape Libraries - Slower access. - Larger capacity. - Cheaper. Contents 6  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Contents 7  Storage of Databases  Introduction  Magnetic Disks  Magnetic Tapes  Records & Files  Record Blocking  Typical Operations on Files  Record Organization of Files  Unordered Files  Ordered Files  Hash Files  Static Hashing  Hashing for Dynamic Files Storage of Databases (1) 8  Most databases are large and stored permanently.  Databases usually reside on secondary and tertiary storage because:  Databases are too large to fit entirely in main memory.  Volatility of primary storage devices.  The high cost of primary storage devices per unit of data. Storage of Databases (2) 9  Databases are stored physically as files of records.  Each record is a collection of data values.  The values can be interpreted as facts about entities, their attributes, and their relationships.  Objective: locating records efficiently when they are needed.  Portions of the database are read into and written from buffers in main memory as needed.  File organizations:  Primary organization: the physical arrangement of file records on the disk.  Secondary organization (or auxiliary access structure): allows efficient access to file records based on alternate fields than those used for the primary file organization (E.g. indexes). Storage of Databases (3) 10  Magnetic disks  For online databases (accessed and processed frequently).  Can be accessed directly at any time.  Magnetic tapes  For offline databases (backing up).  Lower cost than magnetic disks.  Access is quite slow  an automatic loading device to load a tape is needed before the data becomes available. Contents 11  Storage of Databases  Introduction  Magnetic Disks  Magnetic Tapes  Records & Files  Record Blocking  Typical Operations on Files  Record Organization of Files  Unordered Files  Ordered Files  Hash Files  Static Hashing  Hashing for Dynamic Files Magnetic Disks (1) 12  Magnetic disk  Shaped as a thin circular disk.  Made of magnetic material.  Data stored as magnetized areas on surfaces.  Single-sided vs. Double-sided.  Protected by a plastic or acrylic cover.  Disk pack  Contain several magnetic disks connected to a rotating spindle.  Mounted in the disk drive.  A read-write head  For each surface.  Attached to a mechanical arm.  All arms are connected to an actuator.  A disk is a random access addressable device. Magnetic Disks (2) 13 14 Magnetic Disks (3) 15  Disks are divided into tracks on each disk surface.  Tracks: concentric circles of small width.  Cylinder: include tracks with the same diameter on the various surfaces.  Data stored on one cylinder can be retrieved much faster than if it were distributed among different cylinders.  A track is divided into smaller blocks or sectors.  The division into sectors:  Hard-coded (cannot be changed).  Have many types of sector organization.  The division into blocks:  A track is divided into equal-sized blocks.  Set by the OS during disk formatting.  The block size B is fixed for each system.  Typical block sizes: from 512 bytes to 8192 bytes.  Transfer of data between main memory and disk takes place in units of disk blocks. Magnetic Disks (4) 16 Magnetic Disks (5) 17  Locate and transfer an arbitrary block, given its address:  Position the read/write head on the correct track (Seek time).  Disk rotation moves the block under the read-write head.  The total time = seek time + rotational delay + block transfer time.  The seek time and rotational delay are usually much larger than the block transfer time.  Double buffering can be used to speed up the transfer of contiguous disk blocks.  A physical disk block (hardware) address consists of:  a cylinder number.  a track number or surface number (within the cylinder).  block number (within track). 18 Contents 19  Storage of Databases  Introduction  Magnetic Disks  Magnetic Tapes  Records & Files  Record Blocking  Typical Operations on Files  Record Organization of Files  Unordered Files  Ordered Files  Hash Files  Static Hashing  Hashing for Dynamic Files Magnetic Tapes 20  Magnetic tapes are sequential access devices.  Data is stored on reels of high-capacity magnetic tape.  Tape access can be slow.  Usage  Not used to store online data, except for some specialized applications.  Used for:  Back-up database.  In case the data is lost due to a disk crash.  Excessively large database files.  Archived database (database files that are seldom used or are outdated but required for historical record keeping). 21 Contents 22  Storage of Databases  Introduction  Magnetic Disks  Magnetic Tapes  Records & Files  Record Blocking  Typical Operations on Files  Record Organization of Files  Unordered Files  Ordered Files  Hash Files  Static Hashing  Hashing for Dynamic Files Records & Files 23  Data is stored in the form of records.  Records contain fields which have values of particular types.  E.g.: amount, date, time, age  Record type: collection of field names and their corresponding data types.  Fields may have fixed lengths or variable lengths.  Separator characters, field lengths or field type codes may be needed so that records can be “parsed”.  A file is a sequence of records.  A file descriptor (or file header) includes information describing the file.  A file can be made up of fixed-length records or variable-length records. 24 Record Blocking (1) 25  Techniques of the disk block allocation for a file may be combined:  Contiguous allocation.  Linked allocation.  Clusters / File segments / Extents (linked clusters of consecutive disk blocks).  Indexed allocation.  File records can be unspanned or spanned:  Unspanned: no record can span two blocks.  Usually used for files of fixed-length records.  Spanned: a record can be stored in more than one block.  Usually used for files of variable-length records.  Files of variable-length records require additional information stored in each record, such as separator characters and field types.  The blocking factor bfr: the (average) number of records per block.  The block size may be larger or smaller than the record size. Record Blocking (2) 26 Typical Operations on Files 27  OPEN: Prepare the file for access, and associates a pointer refering to a current file record at each point in time.  FIND: Search for the first file record satisfying a certain condition, and make it the current file record.  FINDNEXT: Search for the next file record (from the current record) satisfying a certain condition, and make it the current file record.  READ: Read the current file record into a program variable.  INSERT: Insert a new record into the file and make it the current file record.  DELETE: Remove the current file record from the file.  MODIFY: Change the values of some fields of the current file record.  CLOSE: Terminate access to the file.  REORGANIZE: Reorganize the file records.  For example, the records marked deleted are physically removed from the file or a new organization of the file records is created.  READ_ORDERED: Read the file blocks in order of a specific field. Contents 28  Storage of Databases  Introduction  Magnetic Disks  Magnetic Tapes  Records & Files  Record Blocking  Typical Operations on Files  Record Organization of Files  Unordered Files  Ordered Files  Hash Files  Static Hashing  Hashing for Dynamic Files Record Organization of Files 29  Types of record organization in a file:  Unordered file.  Ordered file.  Hash file. Unordered Files 30  Also called a heap file or a pile file.  Insertion: New records are inserted at the end of the file  very efficient.  Search for a record: use a linear search.  Reading and searching half the file blocks on the average  quite expensive.  Deletion: rewrite the block or use deletion marker  Require one of operations:  Periodic file reorganization to reclaim the unused space.  Using the space of deleted records for inserting.  Modifying a variable-length record may require deleting the old record and inserting a modified record.  Reading the records in order of a particular field requires sorting the file records. Ordered Files 31  Also called a sequential file.  The physical order of records in a file is based on the values of an ordering field.  Search for a record on its ordering field: use a binary search.  Access log2 of the file blocks on the average.  Reading the records in order of the ordering field: quite efficient.  Insertion and deletion: expensive.  Insertion: improve efficiency by one of techniques:  Keep some unused space in each block for new records.  Keep a temporary unordered file (overflow / transaction file) for new records  periodically merged with the main ordered file.  Deletion: use deletion markers and periodic reorganization.  Modification: depend on the search condition and the modified field. 32 Average Access Times 33 Contents 34  Storage of Databases  Introduction  Magnetic Disks  Magnetic Tapes  Records & Files  Record Blocking  Typical Operations on Files  Record Organization of Files  Unordered Files  Ordered Files  Hash Files  Static Hashing  Hashing for Dynamic Files Hash Files 35  Hashing for disk files is called External Hashing.  The target address space is made of buckets.  Each bucket holds multiple records.  A bucket is either one disk block or a cluster of contiguous disk blocks.  A record is stored in bucket i: 𝑖 = ℎ(𝐾).  h: a hash function.  K: the hash key value of a record.  Search is very efficient on the hash key.  A collision occurs when a new record hashes to a bucket that is already full.  Good hash function: distribute records uniformly over the address space  Minimize collisions (decrease search time) and not leave many unused locations.  A hash file should be kept 70 - 90% full. Static Hashing (1) 36  A fixed number of buckets M is allocated.  The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucketM-1.  Serious drawback for dynamic files:  Fixed number of buckets M is a problem if the number of records in the file grows or shrinks.  May have to change M and the hashing function  quite time-consuming for large files.  Collision resolution:  An overflow file is kept for storing such records.  Overflow records that hash to each bucket can be linked together. Static Hashing (2) 37 Static Hashing (3) 38 Hashing for Dynamic Files 39  Hashing techniques for expanding or shrinking a file dynamically:  Extendible hashing: store an access structure in addition to the file.  Dynamic hashing: use an access structure based on binary tree data structures.  Linear hashing: not require additional access structures.  Extendible and dynamic hashing do not require an overflow area. Extendible Hashing 40  𝒅: the global depth of the directory.  Directory: an array of 2𝑑 bucket addresses.  The first (high-order) 𝒅 bits of a hash value determine a directory entry.  The address in each directory entry determines the bucket.  𝒅′: the local depth stored with each bucket.  Specify the number of bits on which the bucket contents are based.  Some directory entries with the same first 𝒅’ bits for their hash values may contain the same bucket address if all their records fit in a single bucket.  Doubling d occurs if a bucket having 𝑑’ = 𝑑 overflows.  Halving d occurs if 𝑑 > 𝑑’ for all the buckets.  Most record retrievals require 2 block accesses - one to the directory and the other to the bucket. 41 Dynamic Hashing 42  The eventual storage of records in buckets is similar to extendible hashing.  Directory: use a binary tree.  Internal nodes having 2 pointers  Left pointer for the 0 bit (in the hashed address).  Right pointer for the 1 bit (in the hashed address).  Leaf nodes: hold a pointer to the actual bucket. 43 Linear Hashing (1) 44  No directory but maintain overflow area(s) for collisions.  At first:  𝑛 = 0; 𝑀 buckets numbered 0, 1, , 𝑀 − 1  Initial hash function: ℎ𝑖(𝐾) = 𝐾 𝑚𝑜𝑑 𝑀  When a collision leads to an overflow record in any bucket:  Split bucket 𝑛 into 2 buckets: 𝑛 (original) & 𝑀 + 𝑛 (new).  Redistribute records of the original bucket into the 2 buckets based on ℎ𝑖+1(𝐾) = 𝐾 𝑚𝑜𝑑 2𝑀.  𝑛 = 𝑛 + 1  Buckets are split in the linear order 0, 1, 2, 3, ....  When 𝑛 = 𝑀:  All the original buckets have been split  the records in overflow are eventually redistributed into regular buckets.  The file now has 2𝑀 instead of 𝑀 buckets.  All buckets use ℎ𝑖+1  𝑛 is reset to 0. Linear Hashing (2) 45  Generally, a sequence of hashing functions is used:  ℎ𝑖+𝑗 𝐾 = 𝐾 𝑚𝑜𝑑 (2 𝑗𝑀), where 𝑗 = 0, 1, 2,  Retrieve a record with hash key value K:  Apply ℎ𝑖(𝐾).  If ℎ𝑖 𝐾 < 𝑛  apply ℎ𝑖+1(𝐾).  The file load factor can be used to trigger splits and combinations.  The file load factor 𝑙 = 𝑟/(𝑏𝑓𝑟 ∗ 𝑁)  𝑟: the current number of file records.  𝑏𝑓𝑟: the maximum number of records that can fit in a bucket.  𝑁: the current number of file buckets.  Split when 𝑙 is larger than a certain threshold (instead of whenever an overflow occurs).  Recombine when 𝑙 falls below a certain threshold. Contents 46  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) RAID Technology (1) 47  Secondary storage technology must take steps to keep up in performance and reliability with processor technology.  Parallelizing disk access using RAID technology (Redundant Arrays of Independent Disks).  The main goal of RAID is to even out the widely different rates of performance improvement of disks against those in memory and microprocessors. RAID Technology (2) 48  A large array of small independent disks acts as a single higher-performance logical disk.  Data striping:  Distribute data transparently over multiple disks to make them appear as a single large, fast disk.  Utilize parallelism to improve disk performance. RAID Technology (3) 49  Different RAID organizations were defined based on different combinations of the two factors of granularity of data interleaving (striping) and pattern used to compute redundant information.  RAID level 0: no redundant data  the best write performance at the risk of data loss.  RAID level 1 uses mirrored disks.  RAID level 2 uses memory-style redundancy by using Hamming codes (contain parity bits for distinct overlapping subsets of components). Level 2 includes both error detection and correction.  RADI level 3 uses a single parity disk relying on the disk controller to figure out which disk has failed.  RAID levels 4 and 5 use block-level data striping, with level 5 distributing data and parity information across all disks.  RAID level 6 applies the so-called P + Q redundancy scheme using Reed-Soloman codes to protect against up to two disk failures by using just two redundant disks. 50 Contents 51  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Storage Area Networks (1) 52  Organizations have a need to move from a static fixed data center oriented operation to a more flexible and dynamic infrastructure for information processing.  Storage Area Networks (SANs).  In a SAN:  Online storage peripherals are configured as nodes on a high-speed network and can be attached and detached from servers in a very flexible manner.  Allows storage systems to be placed at longer distances from the servers and provide different performance and connectivity options. Storage Area Networks (2) 53  Advantages of SANs:  Flexible many-to-many connectivity among servers and storage devices using fiber channel hubs and switches.  Up to 10km separation between a server and a storage system using appropriate fiber optic cables.  Better isolation capabilities allowing non-disruptive addition of new peripherals and servers.  SANs face the problem of combining storage options from multiple vendors and dealing with evolving standards of storage management software and hardware. 54 Contents 55  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Book Index 56 Indexes (1) 57  Index structures  Additional auxiliary files.  Provide secondary access paths to records without affecting their physical placement in the data file.  Enable efficient search based on the indexing fields.  An index is usually specified on one field of the file and also can be specified on multiple fields.  Multiple indexes on different fields can be constructed on the same file. Indexes (2) 58  The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller.  Indexes can be dense or sparse.  A dense index has an index entry for every search key value (and hence every record) in the data file.  A sparse (or nondense) index has index entries for only some of the search values. Contents 59  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Contents 60  Single-Level Ordered Indexes  Primary Indexes  Clustering Indexes  Secondary Indexes Single-Level Ordered Indexes 61  Form of an index: a file of entries <field value, pointer to block/record> ordered by field value.  A binary search on the index requires fewer block accesses than a binary search on the data file.  Types of single-level ordered indexes:  Primary index: on the ordering key field of an ordered file.  Clustering index: on the ordering field (but not key field) of an ordered file.  Secondary index: on any nonordering field of a file.  A file can have at most one physical ordering field.  It can have at most one primary index or one clustering index, but not both.  A data file can have several secondary indexes in addition to its primary access method. Primary Indexes (1) 62  A primary index is specified on the ordering key field (primary key) of an ordered data file.  Records in the data file is physically ordered on the ordering key field.  Index file:  An ordered file of fixed length index entries.  One index entry for each data block:  𝐾 𝑖 : the key field value of the first record in block 𝑖 (called the block anchor or anchor record).  𝑃(𝑖): a pointer to block 𝑖.  A primary index is a sparse index.  A similar scheme can use the last record in a block. 63 Primary Indexes (2) 64  E.g.: Given the following data file: EMPLOYEE (Name, Ssn, Address, Job, Sal,...)  Suppose that:  Record size: 𝑅 = 100 𝑏𝑦𝑡𝑒𝑠.  Block size: 𝐵 = 1024 𝑏𝑦𝑡𝑒𝑠.  Number of records: 𝑟 = 30,000 𝑟𝑒𝑐𝑜𝑟𝑑𝑠.  Then, we get:  Blocking factor: 𝑏𝑓𝑟 = 𝐵/𝑅 = 1024/100 = 10 𝑟𝑒𝑐𝑜𝑟𝑑𝑠/𝑏𝑙𝑜𝑐𝑘.  Number of file blocks: 𝑏 = 𝑟/𝑏𝑓𝑟 = 30000/10 = 3000 𝑏𝑙𝑜𝑐𝑘𝑠.  For a primary index on the ordering key field SSN, assume the field size 𝑉𝑆𝑆𝑁 = 9 𝑏𝑦𝑡𝑒𝑠 , the block pointer size 𝑃𝑅 = 6 𝑏𝑦𝑡𝑒𝑠. Primary Indexes (3) 65  Then:  Index entry: 𝑅𝑖 = (𝑉𝑆𝑆𝑁 + 𝑃𝑅) = (9 + 6) = 15 𝑏𝑦𝑡𝑒𝑠.  Index blocking factor: 𝑏𝑓𝑟𝑖 = 𝐵/𝑅𝑖 = 1024/15 = 68 𝑒𝑛𝑡𝑟𝑖𝑒𝑠/ 𝑏𝑙𝑜𝑐𝑘.  Number of index blocks: 𝑏𝑖 = 𝑟𝑖/𝑏𝑓𝑟𝑖 = 𝑏/𝑏𝑓𝑟𝑖 = 3000/68 = 45 𝑏𝑙𝑜𝑐𝑘𝑠.  Binary search on the index file: 𝑙𝑜𝑔2(𝑏𝑖) = 𝑙𝑜𝑔2(45) = 6 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.  To search for a record using the index, we need one additional block access to the data file: 6 + 1 = 7 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.  Compare with the average cost of binary search on the ordering key field without using the index: 𝑙𝑜𝑔2(𝑏) = 𝑙𝑜𝑔2(3000) = 12 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠. Contents 66  Single-Level Ordered Indexes  Primary Indexes  Clustering Indexes  Secondary Indexes Clustering Indexes 67  A clustering index is defined on the ordering nonkey field (clustering field) of an ordered data file.  The file records are physically ordered on the clustering field.  A nonkey field does not have a distinct value for each record.  The index includes one index entry for each distinct value of the clustering field.  Each index entry contains the clustering value and a pointer to the first data block having records of that clustering value.  A clustering index is a sparse index. 68 69 Contents 70  Single-Level Ordered Indexes  Primary Indexes  Clustering Indexes  Secondary Indexes Secondary Indexes (1) 71  A secondary index provides a secondary means of accessing a file for which some primary access already exists.  The data file records could be ordered, unordered, or hashed.  The secondary index is specified on nonordering field which may be a candidate key or a non-key.  Many secondary indexes can be created for the same file.  The index is an ordered file with two fields. Secondary Indexes (2) 72  A secondary index on a nonordering key field:  One index entry for each record in the data file.  Include the value of the field and a pointer either to the block storing the record or to the record itself. A dense index.  The improvement in search time for an arbitrary record is much greater for a secondary index than for a primary index. 73 Secondary Indexes (3) 74  E.g.: (contitnue the example of Primary Indexes)  We have:  Record size: 𝑅 = 100 𝑏𝑦𝑡𝑒𝑠.  Block size: 𝐵 = 1024 𝑏𝑦𝑡𝑒𝑠.  Number of records: 𝑟 = 30,000 𝑟𝑒𝑐𝑜𝑟𝑑𝑠.  Blocking factor: 𝑏𝑓𝑟 = 10 𝑟𝑒𝑐𝑜𝑟𝑑𝑠/𝑏𝑙𝑜𝑐𝑘.  Number of file blocks: 𝑏 = 3000 𝑏𝑙𝑜𝑐𝑘𝑠.  Primary index on the ordering key field 𝑉1:  Ordering key field size: 𝑉1 = 9 𝑏𝑦𝑡𝑒𝑠.  Block pointer size: 𝑃𝑅 = 6 𝑏𝑦𝑡𝑒𝑠.  Search on 𝑉1 using the index: 7 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.  Binary search on 𝑉1: 𝑙𝑜𝑔2(𝑏) = 12 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠. Secondary Indexes (4) 75  Secondary index on the nonordering key field 𝑉2:  Ordering key field size: 𝑉2 = 9 𝑏𝑦𝑡𝑒𝑠.  Block pointer size: 𝑃𝑅 = 6 𝑏𝑦𝑡𝑒𝑠.  Index entry: 𝑅𝑖 = (𝑉𝑆𝑆𝑁 + 𝑃𝑅) = (9 + 6) = 15 𝑏𝑦𝑡𝑒𝑠.  Index blocking factor: 𝑏𝑓𝑟𝑖 = 𝐵/𝑅𝑖 = 1024/15 = 68 𝑒𝑛𝑡𝑟𝑖𝑒𝑠/ 𝑏𝑙𝑜𝑐𝑘.  Number of index blocks: 𝑏𝑖 = 𝑟𝑖/𝑏𝑓𝑟𝑖 = 𝑟/𝑏𝑓𝑟𝑖 = 30000/68 = 442 𝑏𝑙𝑜𝑐𝑘𝑠.  Binary search on the index: 𝑙𝑜𝑔2(𝑏𝑖) = 𝑙𝑜𝑔2(442) = 9 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.  To search for a record using the index, we need one additional block access to the data file: 9 + 1 = 10 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠.  Linear search on 𝑉2 : 𝑏/2 = 3000/2 = 1500 𝑏𝑙𝑜𝑐𝑘 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠. Secondary Indexes (5) 76  A secondary index on a nonordering nonkey field:  Option 1: include duplicate index entries with the same K(i) value – one for each record  a dense index.  Option 2: keep a list of pointers in the index entry for K(i) - one pointer to each block that contains a record of K(i).  Option 3: a single entry for each index field value, the pointer P(i) points to a disk block including record pointers, each record pointer points to one of the data file records of value K(i). If some value K(i) occurs in too many records, a cluster or linked list of blocks is used  nondense. 77 78 Contents 79  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Multi-Level Indexes (1) 80  When single-level index is an ordered file with a distinct value for each K(i)  Can create a primary index to the index itself.  The original index file is called the first-level index and the index to the index is called the second- level index.  We can repeat the process, creating a third, fourth, ..., top level until all entries of the top level fit in one disk block. Multi-Level Indexes (2) 81  A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block.  The blocking factor for the index (𝑏𝑓𝑟𝑖) is called the fan-out of the multilevel index (𝒇𝒐).  At each search step: n-ways search (𝑛 = 𝑓𝑜).  The number levels of a multi-level index: 𝑡 = 𝑙𝑜𝑔𝑓𝑜(𝑟1)  𝑟1: the number of index entries in the first level index.  Such a multi-level index is a form of search tree.  However, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file. 82 Contents 83  Data Storage  Storage Hierarchy  Storage of Databases  RAID Technology  Storage Area Networks  Indexing Structures for Files  Indexes  Single-Level Ordered Indexes  Multi-Level Indexes  Dynamic Multilevel Indexes (B-tree & B+-tree) Search Tree 84 Dynamic Multilevel Indexes (1) 85  Multilevel indexing has insertion and deletion problems  use a dynamic multilevel index that leaves some space in each of its blocks for inserting new entries  use appropriate insertion/deletion algorithms.  These data structures are variations of search trees that allow efficient insertion and deletion of new search values.  Most multi-level indexes use B-tree or B+-tree data structures because of the insertion and deletion problem.  In B-Tree and B+-Tree data structures, each node corresponds to a disk block. Dynamic Multilevel Indexes (2) 86  An insertion into a node that is not full is quite efficient.  If a node is full the insertion causes a split into two nodes.  Splitting may propagate to other tree levels.  A deletion is quite efficient if a node does not become less than half full.  If a deletion causes a node to become less than half full, it must be merged with neighboring nodes. Dynamic Multilevel Indexes (3) 87  Main differences between B-tree and B+-tree:  In a B-tree, pointers to data records exist at all levels of the tree.  In a B+-tree, all pointers to data records exists at the leaf-level nodes.  Each leaf node of B+-tree has a pointer to the next leaf node of the tree.  A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree. B-tree Structures B+-tree Structures 89 90 Example of Insertion in a B+-tree 91 Example of Deletion in a B+-tree Q & A 92 Exercise (1) 93 Consider a disk with block size B = 512 bytes. A block pointer is P = 6 bytes long, and a record pointer is PR = 7 bytes long. A file has r = 30,000 EMPLOYEE records of fixed length. Each record has the following fields: Name (30 bytes), Ssn (9 bytes), Department_code (9 bytes), Address (40 bytes), Phone (10 bytes), Birth_date (8 bytes), Sex (1 byte), Job_code (4 bytes), and Salary (4 bytes, real number).An additional byte is used as a deletion marker. Exercise (2) 94 1. Calculate the record size R in bytes. 2. Calculate the blocking factor bfr and the number of file blocks b, assuming an unspanned organization. 3. Suppose that the file is ordered by the key field Ssn and we want to construct a primary index on Ssn. Calculate: a. The index blocking factor bfri. b. the number of first-level index entries and the number of first-level index blocks. 4. If we make it into a multilevel index (two levels). a. Calculate the total number of blocks required by the multilevel index. b. the number of block accesses needed to search for and retrieve a record from the file—given its Ssn value.

Các file đính kèm theo tài liệu này:

7_data_storage_indexing_structures_for_files_3655.pdf