Database second test _ summary

Tue, May 19, 2020 3-minute read

Chapter 11 : Data analysis

Data Warehousing Data sources often store only current data, not historical data. Data warehouse is a repository of information gathered from multiple sources, stored under a unified schema, at single site.

  • simplifies querying, permits stud of historical trends
  • not a lot of update happens, thus loads less from transaction processing.

Design Issues

  1. Source driven architecture

    • Whenever there is data, transmits.
  2. Destination driven architecture

    • Warehouse pulls data.
  3. Synchronous vs asynchronous replication

Multidimensional Data and Warehouse Schemas

Physical storage

  1. primary storage : fastest media but volatile
  2. Secondary storage : next level in hierarchy, non - volatile, moderately fast. (usb, ssd, …)
  3. tertiary storage : slowest

1. Primary Storage

  • Cache
  • Main memory : fast access (but too small, and volatile)

2. Secondary Storage*

  • Magnetic disk:
  • direct access. very large. sdd, hdd.

Tertiary Storage

Optical storage, tape. very high capacity.

Disk controller

Performace rate:

  • Data - transfer rate
    • rate that controller can handle is also important
  • Mean time to failure (MTTF)

Optimization of Disk - Block access

Block - a contiguous sequence of sectors from a single track:

  • data is transferred between disk and main memory in Blocks
RAID Level 5:
partitions data and parity among N + 1 disks, rather than storing data in N disks and parity in 1 disk

compared to level 1, lower storage overhead but higher time overhead for writes.

Data storage structure

File organization

the database is stored as collection files:

  • File is sequence of records,
  • Record is sequence of fields,
  • It is units of blocks.

How to represent records in a file structure :

  • Fixed length records
  • Variable length records. (multiple record types are stored in a file)

Free lists:

Store the address of the first deleted record in the file header. Rest is normal “list” data structure.

variable length Records.

  • Issues:
    • How to represent a single record in a way that attributes can be extracted easily?

Organization of Records in files

Free space map: array with 1 entry per block, records fraction of blocks that is free. If the block is too many, can make second - level free space map.

Sequential File Organization.

The records in the file are ordred by search - key.

Data dictionary.

  • also called system catalog stores metadata.

Column oriented storage

  • also known as columnar representation. Benefits: reduced IO if only some attributes are accessed. improved compression. Some databases support both representation.

Relational Database design

BCNF

A relational schema R is in BCNF with respect to a set F of FDs if for all functional dependencies in F+ of the form $! a \rarr b \in F^+$! at least one of the following holds:

  • $$ a \rarr b $$ is trivial
  • $$a$$ is superkey for R

For example, if in_dept(ID, name, dept_name, building, ..) is scema, dept_name -> building, budget. But, dept_name is NOT superkey of R (in_dept) Therefore is not BCNF.

BCNF is lossless, but, it isn’t dependency preserving.

a -> b, a,b -> c R1 = a,b, R2 = a,c

But, no a,b -> c is implied here.

Example of BCNF

Third Normal Form

lossless, dependency preserving, but redundancy problem

A relational schema R is in third normal form (3NF) if for all:

$! a \rarr b \in F^+ $! at least one follows:

  • $$ a \rarr b $$ is trivial
  • $$a$$ is superkey
  • Each attribute A in $$b - a$$ is contained in a candidate key for R

Example:

R = {J, K, L}

F = {JK -> L, L -> K}

One can see that R is in 3NF. But, there is redundancy in this schema. Since L, K can be repetitive, but has to be listed fully in R.

Denormalization for Performance