Database second test _ summary
Chapter 11 : Data analysis
Data Warehousing Data sources often store only current data, not historical data. Data warehouse is a repository of information gathered from multiple sources, stored under a unified schema, at single site.
- simplifies querying, permits stud of historical trends
- not a lot of update happens, thus loads less from transaction processing.
Design Issues
-
Source driven architecture
- Whenever there is data, transmits.
-
Destination driven architecture
- Warehouse pulls data.
-
Synchronous vs asynchronous replication
Multidimensional Data and Warehouse Schemas
Physical storage
- primary storage : fastest media but volatile
- Secondary storage : next level in hierarchy, non - volatile, moderately fast. (usb, ssd, …)
- tertiary storage : slowest
1. Primary Storage
- Cache
- Main memory : fast access (but too small, and volatile)
2. Secondary Storage*
- Magnetic disk:
- direct access. very large. sdd, hdd.
Tertiary Storage
Optical storage, tape. very high capacity.
Disk controller
Performace rate:
- Data - transfer rate
- rate that controller can handle is also important
- Mean time to failure (MTTF)
Optimization of Disk - Block access
Block - a contiguous sequence of sectors from a single track:
- data is transferred between disk and main memory in Blocks
- RAID Level 5:
- partitions data and parity among N + 1 disks, rather than storing data in N disks and parity in 1 disk
-
compared to level 1, lower storage overhead but higher time overhead for writes.
Data storage structure
File organization
the database is stored as collection files:
- File is sequence of records,
- Record is sequence of fields,
- It is units of blocks.
How to represent records in a file structure :
- Fixed length records
- Variable length records. (multiple record types are stored in a file)
Free lists:
Store the address of the first deleted record in the file header. Rest is normal “list” data structure.
variable length Records.
- Issues:
- How to represent a single record in a way that attributes can be extracted easily?
Organization of Records in files
Free space map: array with 1 entry per block, records fraction of blocks that is free. If the block is too many, can make second - level free space map.
Sequential File Organization.
The records in the file are ordred by search - key.
Data dictionary.
- also called system catalog stores metadata.
Column oriented storage
- also known as columnar representation. Benefits: reduced IO if only some attributes are accessed. improved compression. Some databases support both representation.
Relational Database design
BCNF
A relational schema R is in BCNF with respect to a set F of FDs if for all functional dependencies in F+ of the form $! a \rarr b \in F^+$! at least one of the following holds:
- $$ a \rarr b $$ is trivial
- $$a$$ is superkey for R
For example, if in_dept(ID, name, dept_name, building, ..) is scema, dept_name -> building, budget. But, dept_name is NOT superkey of R (in_dept) Therefore is not BCNF.
BCNF is lossless, but, it isn’t dependency preserving.
a -> b, a,b -> c R1 = a,b, R2 = a,c
But, no a,b -> c is implied here.
Example of BCNF
Third Normal Form
lossless, dependency preserving, but redundancy problem
A relational schema R is in third normal form (3NF) if for all:
$! a \rarr b \in F^+ $! at least one follows:
- $$ a \rarr b $$ is trivial
- $$a$$ is superkey
- Each attribute A in $$b - a$$ is contained in a candidate key for R
Example:
R = {J, K, L}
F = {JK -> L, L -> K}
One can see that R is in 3NF. But, there is redundancy in this schema. Since L, K can be repetitive, but has to be listed fully in R.