Topics Map > Endpoint Support > Linux

Using Compressed Data in Linux

Linux has a variety of tools for working with compressed data. This article will describe how to use them, and why.

The catch is that it takes CPU time to compress or uncompress a file. Thus compression is really a way to trade CPU power for disk space. For files you use constantly, this may not be a good trade. But we strongly encourage you to compress any data sets you are not using on a regular basis. The SSCC's current disk space was quite costly and we hope to avoid adding to it any sooner than necessary. This article will not attempt cover all the available compression tools or all the things they can do, just the most common usage. Full details are available by typing man and then the name of the command in Linux (e.g.man compress).

Compression Types

gzip/gunzip

gzip file will replace file with compressedfile.gz. gunzip file will replace the compressed file with the original.

bzip2/bunzip2

bzip2 and bzip2 are another variation on the same theme. bzip2 file will replace file with the compressed file.bz2. bunzip2 file.bz2 will replace the compressed file with the original. Note that in this case you must type the .bz2 at the end of the name of the file to be uncompressed.

zip/unzip

zip works slightly differently in that it asks you to name the compressed file: zip compressedFile file will create compressedFile.zip (the.zip is added automatically), containing a compressed version of file. The original file is not removed. unzip compressedFile will recreate the original file. The compressed file is not removed.

compress/uncompress

The compress and uncompress commands are older, but you might still run across files in that format. They are very easy to use: 

  • compress file Will replace file with the compressed file, file.Z (think zipped).

  • uncompress file replaces the compressed file with the original. Uncompress doesn't care if you include the.Z at the end or not--it will find the file either way.

7-Zip

The 7-zip program handles many different compression and archive types (gzip, bzip2, tar, zip, tar) as well as its own highly compressed 7z format. Its main program is 7za and the syntax is slightly different:

  • To create an archive, use 7za a (for add): 7za a data.zip data20231115.csv
  • To extract, use 7za x (for extract): 7za x data.zip 

Which Command Should I Use?

Unfortunately which command will work best depends on the exact properties of the file you're working with. Bzip will usually give the best compression, while Zip files are more easily used on Windows.

How Do I Uncompress this File?

Suppose you've obtained a file, perhaps via email or from the web, and you know it's compressed but you don't know what program was used to compress it. Look at the last letters of the file name, following the period:

Last Letters of the File Name... Program it was probably compressed with...
.Z compress
.gz gzip
.bz2 bzip2
.zip

zip (possibly a Windows program like Winzip)

.7z

7-Zip

Note that both uncompress and unzip will handle Windows .zip files just fine. Feel free to just experiment: if you try to uncompress a file using a program that can't read the needed format, it will just give you an error message and quit.

zcat/bzcat

The zcat command reads a compressed file and sends the results to the standard output (use bzcat with bzip2). Just typing zcat file where file is a compressed file will display the tables of the file on the screen. But the real point is to use the results in other programs. For example, to see the results one page at a time pipe the output to the more command: zcat file | more. Both SAS and Stata can read directly from the output of the zcat command. For instructions see Using Compressed Data in SAS or Using Stata on Linux. Note that SAS has compression built in as a dataset option. Stata users should consider using the user-written gzsave and gzuse commands. These act just like the regular save and use commands, but the file on disk is compressed just as if you had used gzip on it.



Keywords:
compressed, data, linux 
Doc ID:
96033
Owned by:
Russell D. in Social Science Computing Cooperative
Created:
2019-11-20
Updated:
2023-11-15
Sites:
Social Science Computing Cooperative