Archiving, compression and Backup
What is compression?
Compression is the reduction of the number of bits needed to store data.
Why is compression useful?
Compression saves storage space and allows for speedy file transfers and low bandwidth. Furthermore, it's useful in creating backups in case of disk failure.
The data-holding services we use today all use some sort of compression to limit file size. For example, mp3 players have a special .mp3 file format that compresses music files.
How is compression achieved?
Compression is mainly achieved by removing redundant data. For example, if you had an image of the flag below.
Two types of algorithms
There are two types of compression algorithms - lossy and lossless.
Lossy algorithms
In a lossy algorithm, some of the data is lost in trade for a smaller file size. An example of this would be the .mp3 file format, which eliminites less audible sounds.
Lossless algorithms
Lossless files, on the other hand, preserves all data contained in the original file. Considering any loss of data is intolerable in files, we'll be looking at compressing using commands that use lossless algorithms.
Compressing Files gzip, gunzip, zcat, zless
gzip
The most popular Unix File compression command is gzip
. gzip
repalces the original file with a compressed version, which has a .gz file extension.
$ ls -l
-rw-r--r-- 1 JohnDoe staff 285 May 9 13:27 README.md
-rw-r--r-- 1 JohnDoe staff 6459 May 9 13:27 index.html
-rw-r--r-- 1 JohnDoe staff 5341 May 9 13:27 todo.txt
$ gzip README.md todo.txt index.html
$ ls -l
-rw-r--r-- 1 JohnDoe staff 56 May 9 13:27 README.md.gz
-rw-r--r-- 1 JohnDoe staff 2626 May 9 13:27 index.html.gz
-rw-r--r-- 1 JohnDoe staff 2113 May 9 13:27 todo.txt.gz
As you can see, the files have been individually compressed. Note, however, that you can't gzip
a folder.
gunzip
To reverse a file back to its uncompressed form, use gunzip
. The uncompressed file will have the same permissionas and timestamp as when it was gzipped.
$ gunzip README.md index.html todo.txt
-rw-r--r-- 1 JohnDoe staff 285 May 9 13:27 README.md
-rw-r--r-- 1 JohnDoe staff 6459 May 9 13:27 index.html
-rw-r--r-- 1 JohnDoe staff 5341 May 9 13:27 todo.txt
Note how we don't need to specify the .gz in the filename, as that is already assumed.
Options
There are several options you can use with gzip
. Here are a list of the most common ones - be sure to check the man page for more.
- -c
- Write output to standard output and keep original files.
- -d
- Decompress - same as using gunzip.
- -f
- Force overwriting and compress links.
- -h
- Display usage information. may also be specified with --help
- -k
- Retail original files.
- -l
- List the compression ratio for each file compressed.
- -r
- Recursively compress files in the directory.
- -t
- Test the integrity of a compressed file.
- -v
- Verbose
- -number
- Set amount of compression. number is an integer in the range of 1 to 9. 1 is fastest, but has the least compression. 9 is slowest with the most compression. The default is 6.
Keeping original files
Suppose you want to keep the original files, and make an extra copy for the gzipped ones. Simply pass in the -k
option.
$ gzip -k README.md
$ ls
...
README.md README.md.gz
...
Viewing compression ratio
We can view the compression ratio with the -l
option.
$ gzip index.html.gz
$ gzip -l index.html.gz
compressed uncompressed ratio uncompressed_name
2626 6459 59.3% index.html
$ gzip -l README.md.gz
compressed uncompressed ratio uncompressed_name
56 285 80.3% README.md
Seeing our ratio, we can tell that our README.md file must have a lot of repeated elements!
zcat and zless
zcat
is the same as using gunzip -c
. It will unzip your file and print to standard out.
$ zcat README.md.gz | less
# unzipped and now you can read with less
$ zless
# same function as above
Compressing already compressed files
Make sure you don't compress an already compressed format. File types such as .mp3 and .jpeg have already been compressed, so a further compression may cause the file to become larger.
As you can tell, gzip
was not meant to compress a group of multiple files - we use the tar
command for that, which we'll see in the next lesson.
Archiving and Compressing Multiple Files tar, zip, unzip
To compress folders and multiple files for transferring between computers in one step, we use the tar
command. The .tar format is used for collecting multiple files into one archive file for distribution or backup.
Name origin
"tar" is short for tape archiving utility. The name is a reminescence of when file were backed up on and occasionally retrieved from magenetic tape, which was then used as a storage device.
Furthermore, the term tarball is a jargon term describing "a bunch of files stuck together in a ball of tar."
Options
Options for tar don't need a hyphen (-
) preceding it. Here are a list of common action options.
- c
- Create an archive. List the list files and/or directories as arguments.
- r
- Append specified pathnames to the end of an archive.
- t
- List the contents of an archive.
- x
- Extract.
With an action option you may include a qualifier.
- v
- Verbose mode.
- f
- Specify the name of the .tar file you want to create.
- P
- Retain the leading / for filenames.
- z
- Process through gzip.
- j
- Process through bzip2.
Some common commands you will use are:
- tar -tf fileName
- List
- tar -xf fileName
- Extract
- tar -cf fileName
- Create
- tar --help
- Help
Examples with tar
Here are some examples with tar
that you'll most likely come across.
1) Creating a tar file
$ tar -cvf myArchive.tar README.md todo.txt index.html
a README.md
a index.html
a todo.txt
# Created myArchive.tar file
2) Unpacking tar files
$ tar xvf archive.tar
3) List contents with table-of-contents mode
Before you go and extract a .tar file, you may want to have a peek inside. The tar
file allows for this with the table-of-contents mode. Simply pass in the -t
option.
4) Preserving original permissions
If you extract the files, the original permission settings may be overwritten with your umask
settings. To preserve the original permissions, use the -p
option.
5) Decompressing .tar.gz files
Oftentimes you'll come across files with a .tar.gz extension. To unpack this, simply use the gunzip
command first, the tar
.
$ gunzip myArchive.tar.gz
$ tar xf myArchive.tar
# Or, even faster...
$ zcat myArchive.tar.gz | tar xvf
zcat
is the same as using gunzip
with the -c
option, which outputs to standard out.
A Windows comparison
If you're working with a Windows user, you may come across with the .zip compression file extension. The command line can also handle zip files in a mannner similar to tar
files.
Zipping
Zipping is as easy to use as the gzip
command. There are two types of command modes that options work in - external and internal. Internal modes (delete and copy) operate exclusively on entries in an existing archive, while external modes (add, update and freshen) read from both files from the file system, and existing archives.
- add
- Update existing entries and add new files. If archive does not exist, create it.
- -d
- Delete select entries.
- -f
- Freshen. Update existing entries if newer on the file system. Does not add any new files to the archive.
- -r
- Recursively zip (include ones in subdirectories).
- -u
- Update existing entries, and add new files.
$ zip myArchive README.md todo.txt index.html
adding: README.md (deflated 90%)
adding: index.html (deflated 60%)
adding: todo.txt (deflated 61%)
Unzipping
To unzip, use the unzip
command.
$ unzip myArchive.zip
Using cpio
Another command similar to tar
is cpio
. This program allows you to create archives from lists of filenames. Additionally, it is used to extract tar archives and copy files from a source to a device or file.
The cpio
command is especially useful since you don't need to create an intermediate file before moving it to another disk.
Three modes
There are three modes that come with the cpio
command.
1) Copy-out mode
The copy-out mode, denoted with the -o
or --create
options, allow you to make an archive and copy files into it. Simply pass in a list of filenames (one per line) into the standard input.
An easily way to generate a list of filenames is with the find
command.
$ find ./sample
./sample
./sample/file1
./sample/file2
./sample/file2
$ find ./sample | cpio -o > /media/myusb/sample.cpio
1 block
This is great, but the file is uncompressed. To compress it, use gzip
.
$ find ./sample | cpio -ov gzip > /media/myusb/sample.cpio.gz
Other options you may use include -v
for verbose and -depth
to specify folder depth.
2) Copy-in mode
Whereas copy-out mode creates an archive, we can use copy-in to extract the archive contents. Simply pass in the -i
or --extract
option, and pass in the archive through standard input.
$ cpio -i < /media/myusb/sample.cpio
# If compressed:
$ gunzip -c /media/myusb/sample.cpio.gz | cpio -i
The -c
option of gunzip
outputs the contents to standard out.
3) Copy-pass mode
The third mode is copy-pass, which is useful for moving files from one directory tree to another without creating an intermediate archive. This mode can be thought of as a combination of the two mentioned above. To activate copy-pass mode, use -p
or the --pass-through
option.
More options
Here is a list of more options you can use with the cpio
command.
-a|--reset-access-time
- Reset the access time of each file so it doesn't appear to have been read.
-A|--append
- For use with copy-out mode - appends data to an existing archive.
-F|--file=filename
- Archive the filename instead of using the default standard in.
-I filename
- Use specified filename instead of standard in.
--no-absolute-filenames
- Extract relative to the current directory, even if a full pathname is given.
-O filename
- Use specified filename instead of default standard out.
-u|--unconditional
- Replace all files without asking for verification.