Hi, I'm Carrie Anne, and welcome to CrashCourse Computer Science! Last episode we talked about data storage,
how technologies like magnetic tape and hard disks can store millions, billions and trillions of bits of data, for long durations, even without power. Which is perfect for recording big blobs
of related data, what are more commonly called computer files. Youve no doubt encountered many types,
like text files, music files, photos and videos. Today, were going to talk about how files
work, and how computers keep them all organized with File Systems.
INTRO Its perfectly legal for a file to contain
arbitrary, unformatted data, but its most useful and practical if the data inside the
file is organized somehow. This is called a file format. You can invent your own, and programmers do
that from time to time, but its usually best and easiest to use an existing standard, like JPEG and MP3. Lets look at some simple file formats.
The most straightforward are T-X-T files,
which contain, surprise, text. Like all computer files, this is just a huge
list of numbers, stored as binary. If we look at the raw values of a T-X-T file
in storage, it would look something like this: We can view this as decimal numbers instead of binary, but that still doesnt help us read the text. The key to interpreting this data is knowing
that T-X-T files use ASCII, a character encoding standard we discussed way back in Episode 4.
So, in ASCII, our first value, 72, maps to
the capital letter H. And in this way, we decode the whole file. Lets look at a more complicated example:
a WAVE File also called a WAV which stores audio. Before we can correctly read the data, we
need to know some information, like the bit rate and whether its a single track or
stereo.
Data, about data, is called meta data. This metadata is stored at the front of the
file, ahead of any actual data, in whats known as a Header. Heres what the first 44 bytes of a WAV
file looks like. Some parts are always the same, like where it spells out W-A-V-E.
Other parts contain numbers that change depending
on the data contained within. The audio data comes right behind the metadata, and its stored as a long list of numbers. These values represent the amplitude of sound captured many times per second, and if you want a primer on sound, check out our video all about it in Crash Course Physics. Link in the dobblydoo.
As an example, lets look at a waveform
of me saying: "hello!" Hello! Now that weve captured some sound, lets
zoom into a little snippet. A digital microphone, like the one in your
computer or smartphone, samples the sound pressure thousands of times. Each sample can be represented as a number. Larger numbers mean higher sound pressure, whats called amplitude.
And these numbers are exactly what gets stored in a WAVE file! Thousands of amplitudes for every single second of audio! When its time to play this file, an audio
program needs to actuate the computer's speakers such that the original waveform is emitted. Hello! So, now that youre getting the hang of
file formats, lets talk about bitmaps or B-M-Ps, which store pictures. On a computer, PICtures are made up of little tiny square ELements called pixels. Each pixel is a combination of three colors:
red, green and blue.
These are called additive primary colors,
and they can be mixed together to create any other color on our electronic displays. Now, just like WAV files, BMPs start with
metadata, including key values like image width, image height, and color depth. As an example, lets say the metadata specified an image 4 pixels wide, by 4 pixels tall, with a 24-bit color depth - thats 8-bits
for red, 8-bits for green, and 8-bits for blue. As a reminder, 8 bits is the same as one byte.
The smallest number a byte can store is 0,
and the largest is 255. Our image data is going to look something
like this: Lets look at the color of our first pixel. It has 255 for its red value, 255 for green
and 255 for blue. This equates to full intensity red, full intensity green and full intensity blue.
These colors blend together on your computer monitor to become white. So our first pixel is white! The next pixel has a Red-Green-Blue, or RGB value of 255, 255, 0. Thats the color yellow! The pixel after that has a RGB value of 0,0,0
- thats zero intensity everything, which is black. And the next one is yellow.
Because the metadata specified this was a
4 by 4 image, we know that weve reached the end of our first row of pixels. So, we need to drop down a row. The next RGB value is 255,255,0 yellow
again. Okay, lets go ahead and read all the pixels
in our 4x4 image tada! A very low resolution pac-man! Obviously this is a simple example of a small image, but we could just as easily store this image in a BMP.
I want to emphasize again that it doesnt
matter if its a text file, WAV, BMP, or fancier formats we dont have time to discuss, like ZIPs and PPTs. Under the hood, theyre all the same: long
lists of numbers, stored as binary, on a storage device. File formats are the key to reading and understanding
the data inside. Now that you understand files a little better,
lets move on to how computers go about storing them.
Even though the underlying storage medium
might be a strip of tape, a drum, a disk, or integrated circuits... Hardware and software abstractions let us think of storage as a long line of little buckets that store values. In the early days, when computers only performed one computation like calculating artillery range tables the entire storage operated
like one big file. Data started at the beginning of storage,
and then filled it up in order as output was produced, up to the storage capacity.
However, as computational power and storage capacity improved, it became possible, and useful, to store more than one file at a time. The simplest option is to store files back-to-back. This can work... But how does the computer know where files begin and end? Storage devices have no notion of files theyre just a mechanism for storing lots of bits.
So, for this to work, we need to have a special file that records where other ones are located. This goes by many names, but a good general term is Directory File. Most often, its kept right at the front
of storage, so we always know where to access it. Location zero! Inside the Directory File are the names of
all the other files in storage.
In our example, they each have a name, followed by a period, and end with whats called a File Extension, like BMP or WAV. Those further assist programs in identifying file types. The Directory File also stores metadata about these files, like when they were created and last modified, who the owner is, and if it
can be read, written or both. But most importantly, the directory file contains where these files begin in storage, and how long they are.
If we want to add a file, remove a file, change a filename, or similar, we have to update the information in the Directory File. Its like the Table of Contents in a book,
if you make a chapter shorter, or move it somewhere else, you have to update the table of contents, otherwise the page numbers wont match! The Directory File, and the maintenance of
it, is an example of a very basic File System, the part of an Operating System that manages and keep track of stored files. This particular example is a called a Flat
File System, because theyre all stored at one level. Its flat! Of course, packing files together, back-to-back, is a bit of a problem, because if we want to add some data to lets say todo.Txt,
theres no room to do it without overwriting part of carrie.Bmp.
So modern File Systems do two things. First, they store files in blocks. This leaves a little extra space for changes,
called slack space. It also means that all file data is aligned
to a common size, which simplifies management.
In a scheme like this, our Directory File
needs to keep track of what block each one is stored in. The second thing File Systems do, is allow
files to be broken up into chunks and stored across many blocks. So lets say we open todo.Txt, and
we add a few more items then the file becomes too big to be saved in its one block. We dont want to overwrite the neighboring
one, so instead, the File System allocates an unused block, which can accommodate extra data.
With a File System scheme like this, the Directory File needs to store not just one block per file, but rather a list of blocks per file. In this way, we can have files of variable
sizes that can be easily expanded and shrunk, simply by allocating and deallocating blocks. If you watched our episode on Operating Systems,
this should sound a lot like Virtual Memory. Conceptually its very similar! Now lets say we want to delete carrie.Bmp.
To do that, we can simply remove the entry
from the Directory File. This, in turn, causes one block to become
free. Note that we didnt actually erase the files
data in storage, we just deleted the record of it. At some point, that block will be overwritten with new data, but until then, it just sits there.
This is one way that computer forensic teams can recover data from computers even though people think it has been deleted. Crafty! Ok, lets say we add even more items to
our todo list, which causes the File System to allocate yet another block to the file,
in this case, recycling the block freed from carrie.Bmp. Now our todo.Txt is stored across 3
blocks, spaced apart, and also out of order. Files getting broken up across storage like
this is called fragmentation.
Its the inevitable byproduct of files being
created, deleted and modified. For many storage technologies, this is bad
news. On magnetic tape, reading todo.Txt into memory would require seeking to block 1, then fast forwarding to block 5, and then rewinding
to block 3 thats a lot of back and forth! In real world File Systems, large files might
be stored across hundreds of blocks, and you dont want to have to wait five minutes
for your files to open. The answer is defragmentation! That might sound like technobabble, but the process is really simple, and once upon a time it was really fun to watch! The computer copies around data so that files have blocks located together in storage and in the right order.
After weve defragged, we can read our todo file, now located in blocks 1 through 3, in a single, quick read pass. So far, weve only been talking about Flat
File Systems, where theyre all stored in one directory. This worked ok when computers only had a little bit of storage, and you might only have a dozen or so files. But as storage capacity exploded, like we
discussed last episode, so did the number of files on computers.
Very quickly, it became impractical to store
all files together at one level. Just like documents in the real world, its
handy to store related files together in folders. Then we can put connected folders into folders, and so on. This is a Hierarchical File System, and its
what your computer uses.There are a variety of ways to implement this, but lets stick
with the File System example weve been using to convey the main idea.
The biggest change is that our Directory File needs to be able to point not just to files, but also other directories. To keep track of whats a file and whats
a directory, we need some extra metadata. This Directory File is the top-most one, known as the Root Directory. All other files and folders lie beneath this
directory along various file paths.
We can see inside of our Root Directory
File that we have 3 files and 2 subdirectories: music and photos. If we want to see whats stored in our music directory, we have to go to that block and read the Directory File located there; the
format is the same as our root directory. Theres a lot of great songs in there! In addition to being able to create hierarchies of unlimited depth, this method also allows us to easily move around files. So, if we wanted to move theme.Wav from our root directory to the music directory, we dont have to re-arrange any blocks of
data.
We can simply modify the two Directory Files, removing an entry from one and adding it to another. Importantly, the theme.Wav file stays in block 5. So thats a quick overview of the key principles of File Systems. They provide yet another way to move up a
new level of abstraction.
File systems allow us to hide the raw bits
stored on magnetic tape, spinning disks and the like, and they let us think of data as
neatly organized and easily accessible files. We even started talking about users, not programmers, manipulating data, like opening files and organizing them, foreshadowing where the series will be going in a few episodes. Ill see you next week..
No comments:
Post a Comment