Chương 10: Indexing

Exercise Emp (eid: int, salary:int, age: real, did: int) eid is the key, and there’s a clustered index on eid and an unclustered index on age Why not making age another clustered index? How will you use indexes to enforce the constraint that eid is a key? Give an example of a query that can be speeded up because of the available indexes. Give an example that is neither speeded up nor slowed down by the indexes. Can there be an update that can be slowed down because of the indexes?

43 trang | Chia sẻ: vutrong32 | Lượt xem: 1408 | Lượt tải: 0Free

Bạn đang xem trước 20 trang tài liệu Chương 10: Indexing, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chương 10: Indexing1Why Concerning Storage and Indexing? DB design using logical models (ER/Relational).Appropriate level for designers to begin withProvide independence from implementation detailsPerformance: another major factor in user satisfactionDepends onEfficient data structures for data representationEfficiency of system operation on those structuresDisks contains data files and system files including dictionary and index filesDisk access: one of the most critical factor in performance.2Subjects to be DiscussedDisks: Can retrieve random page, but reading several consecutive pages is much cheaper than reading them in random orderBuffer manager: Stages pages from external storage to main memory buffer pool. File and index layers make calls to the buffer manager.File organization: Method of arranging a file of records on external storage.Record id (rid) is sufficient to physically locate recordIndexes are data structures that allow us to find the record ids of records with given values in index search key fields3Storage Hierarchy DBMS stores information on some storage mediumPrimary storage: can be operated directly by CPU.Secondary storage: larger capacity, lower cost, slower accesscannot be operated directly by CPU – must be copied to primary storageSecondary storage has major implications for DBMS designREAD: transfer data to main memoryWRITE: transfer data from main memory.Both transfers are high-cost operations, relative to in-memory operations, so must be planned carefully4Why Not Store Everything in Main Memory?Cost and sizeMain memory is volatile: What’s the problem?Typical storage hierarchy:Factors: access speed, cost per unit, reliabilityCache and main memory (RAM) for currently used data: fast but costlyFlash memory: limited number of writes (and slow), non-volatile, disk-substitute in embedded systemsDisk for the main database (secondary storage).Tapes for archiving older versions of the data (tertiary storage).5DisksSecondary storage device of choice. Data is stored and retrieved in units called disk blocks or pages.Unlike RAM, time to retrieve a disk page varies depending upon location on disk. Therefore, relative placement of pages on disk has major impact on DBMS performance!6Components of a Disk The platters spin The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Only one head reads/writes at any one time. Block size is a multiple of sector size (which is fixed).7Accessing a Disk PageTime to access (read/write) a disk block:Seek time (moving arms to position disk head on track)Rotational delay (waiting for block to rotate under head)Transfer time (actually moving data to/from disk surface)Seek time and rotational delay dominate.Seek time varies from about 1 to 20msecRotational delay varies from 0 to 10msecTransfer rate is less than 1msec per 4KB pageKey to lower I/O cost: reduce seek/rotation delays 8Arranging Pages on Disk`Next’ block concept: blocks on same track, followed byblocks on same cylinder, followed byblocks on adjacent cylinderBlocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay.For a sequential scan, pre-fetching several pages at a time is a big win!9Arranging Pages on Disk10RAIDRedundant Arrays of Independent Disks (RAID)Disk Array: Arrangement of several disks that gives abstraction of a single, large disk.Goals: Increase performance and reliability. Two main techniques: parallelism and redundancyData striping: Data is partitioned; size of a partition is called the striping unit. Partitions are distributed over multiple disks.Redundancy: More disks -> more failures. Redundant information allows reconstruction of data if a disk fails.RAID levels: level 0 – level 6 11Disk Space ManagementLowest layer of DBMS software manages space on disk.Higher levels call upon this layer to:allocate/de-allocate a pageread/write a pageHighly desirable that a request for a sequence of pages to be satisfied by allocating the pages sequentially on disk. 12Structure of a DBMS13Structure of a DBMSA typical DBMS has a layered architecture.The figure does not show the concurrency control and recovery components.This is one of several possible architectures; each system has its own variations.14Buffer Management in a DBMSData must be in RAM for DBMS to operate on it.Table of pairs is maintained.15When a Page is Requested ...If requested page is not in pool:Choose a frame for replacementIf frame is dirty, write it to diskRead requested page into chosen framePin the page and return its address. If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time!16More on Buffer ManagementRequestor of page must unpin it, and indicate whether page has been modified: Dirty bit is used for this.Page in pool may be requested many times, A pin count is used. A page is a candidate for replacement iff pin count = 0.CC & Recovery may require additional I/O when a frame is chosen for replacement. Why? 17Buffer Replacement PolicyFrame is chosen for replacement by a replacement policy:Least-recently-used (LRU), Clock, MRU, etc.Policy can have big impact on # of I/O’s; depends on the access pattern.Sequential flooding: Nasty situation caused by LRU + repeated sequential scans.# buffer frames 15* and 3.0”If data is in sorted file, do binary search to find first such student, then scan to find others.Cost of binary search can be quite high.Idea: Create an `index’ file31Data Entry k* in IndexA data entry k* refers to the records stored in an index file with search key value kWhat does it contain?It contains enough information to locate data record with search key value kIdea: Efficiently search an index with search key value k to find the data entriesUse them to obtain the desired data records For a given search key value k, will there be exactly 0 or 1 data record?32Alternatives for Data Entry k* in IndexIn a data entry k* we can store:Data record with key value k, or, orChoice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k.Examples of indexing techniques: B+ trees, hash-based structuresTypically, index contains auxiliary information that directs searches to the desired data entries33Alternatives for Data Entry k* in IndexAlternative 1:If this is used, index structure is a file organization for data records (instead of a heap file or sorted file).At most one index on a given collection of data records can use Alternative 1. Why?Otherwise, data records are duplicated, leading to redundant storage and potential inconsistency.34Alternatives for Data Entry k* in IndexAlternatives 2 and 3:When are they better than Alternative 1?Data entries typically much smaller than data records. So, they are better than Alternative 1 with large data records, especially if search keys are small.Compare Alternatives 2 and 3 - which one is better?Alternative 3 is more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length.35Index ClassificationPrimary vs. secondary: If search key contains primary key, then called primary index.Different defs in other booksUnique index: Search key contains a candidate key.36Dense vs Sparse IndexDense index: one index entry per search key value.Sparse index: index records for only some of the recordsEvery sparse index is clustered!Sparse indexes are smallerWhich one is faster?Which one has less overhead?37Clustered IndexClustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index.Alternative 1 implies clusteredCan a file be clustered on with more than one search key?Can non-key field become clustering index?Cost of retrieving data records through index varies greatly based on whether index is clustered or not. Why?38Clustered IndexIf the index is clustered, rids of qualifying data entries point to a contiguous collection of records, and we need to retrieve only a few data pages. If unclustered, each qualifying data entry could contain a rid that points to a distinct data page, leading to I/O39Clustered vs. Unclustered IndexSuppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file.To build clustered index, first sort the Heap file (with some free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.)40Clustered vs. Unclustered Index41Indexes with Composite Search Keys Composite Search Keys: Search on a combination of fields.Equality query: Every field value is equal to a constant value. E.g. wrt index:age=12 and sal =75Range query: Some field value is not a constant. E.g.:age =12; or age=12 and sal > 10Data entries in index sorted by search key to support range queries.sue1375bobcaljoe121020801112nameagesal12,2012,1011,8013,7520,1210,1275,1380,111112121310207580Data recordssorted by nameData entries in indexsorted by Data entriessorted by Examples of composite keyindexes using lexicographic order.ExerciseEmp (eid: int, salary:int, age: real, did: int)eid is the key, and there’s a clustered index on eid and an unclustered index on ageWhy not making age another clustered index?How will you use indexes to enforce the constraint that eid is a key?Give an example of a query that can be speeded up because of the available indexes.Give an example that is neither speeded up nor slowed down by the indexes.Can there be an update that can be slowed down because of the indexes?43

Các file đính kèm theo tài liệu này:

chuong10_9862.pptx