Database second test _ summary

Tue, May 19, 2020 3-minute read

Chapter 11 : Data analysis

Data Warehousing Data sources often store only current data, not historical data. Data warehouse is a repository of information gathered from multiple sources, stored under a unified schema, at single site.

simplifies querying, permits stud of historical trends
not a lot of update happens, thus loads less from transaction processing.

Design Issues

Source driven architecture
- Whenever there is data, transmits.
Destination driven architecture
- Warehouse pulls data.
Synchronous vs asynchronous replication

Multidimensional Data and Warehouse Schemas

Physical storage

primary storage : fastest media but volatile
Secondary storage : next level in hierarchy, non - volatile, moderately fast. (usb, ssd, …)
tertiary storage : slowest

1. Primary Storage

Cache
Main memory : fast access (but too small, and volatile)

2. Secondary Storage*

Magnetic disk:

direct access. very large. sdd, hdd.

Tertiary Storage

Optical storage, tape. very high capacity.

Disk controller

Performace rate:

Data - transfer rate
- rate that controller can handle is also important
Mean time to failure (MTTF)

Optimization of Disk - Block access

Block - a contiguous sequence of sectors from a single track:

data is transferred between disk and main memory in Blocks

RAID Level 5:
partitions data and parity among N + 1 disks, rather than storing data in N disks and parity in 1 disk: compared to level 1, lower storage overhead but higher time overhead for writes.

Data storage structure

File organization

the database is stored as collection files:

File is sequence of records,
Record is sequence of fields,
It is units of blocks.

How to represent records in a file structure :

Fixed length records
Variable length records. (multiple record types are stored in a file)

Free lists:

Store the address of the first deleted record in the file header. Rest is normal “list” data structure.

variable length Records.

Issues:
- How to represent a single record in a way that attributes can be extracted easily?

Organization of Records in files

Free space map: array with 1 entry per block, records fraction of blocks that is free. If the block is too many, can make second - level free space map.

Sequential File Organization.

The records in the file are ordred by search - key.

Data dictionary.

also called system catalog stores metadata.

Column oriented storage

also known as columnar representation. Benefits: reduced IO if only some attributes are accessed. improved compression. Some databases support both representation.

Relational Database design

BCNF

A relational schema R is in BCNF with respect to a set F of FDs if for all functional dependencies in F+ of the form $! a \rarr b \in F^+$! at least one of the following holds:

$$ a \rarr b $$ is trivial
$$a$$ is superkey for R

For example, if in_dept(ID, name, dept_name, building, ..) is scema, dept_name -> building, budget. But, dept_name is NOT superkey of R (in_dept) Therefore is not BCNF.

BCNF is lossless, but, it isn’t dependency preserving.

a -> b, a,b -> c R1 = a,b, R2 = a,c

But, no a,b -> c is implied here.

Example of BCNF

Third Normal Form

lossless, dependency preserving, but redundancy problem

A relational schema R is in third normal form (3NF) if for all:

$! a \rarr b \in F^+ $! at least one follows:

$$ a \rarr b $$ is trivial
$$a$$ is superkey
Each attribute A in $$b - a$$ is contained in a candidate key for R

Example:

R = {J, K, L}

F = {JK -> L, L -> K}

One can see that R is in 3NF. But, there is redundancy in this schema. Since L, K can be repetitive, but has to be listed fully in R.

Useless blog on stuff I do