File management/Archive files

Archive Format

edit

When picking an archive file format, you should factor in the type of content and the expected use.

For example, if you want your archive file to be readable without hassle on a computer at school or university, it is recommended to use a widely supported format such as ZIP, GZip, or 7-Zip, preferably the first, since both Windows and Linux support it out of the box. However, GZip or 7-Zip may be desirable for data larger than 4 GB and if ZIP64 support is uncertain. 7-Zip compresses stronger, which especially benefits text-based data, but GZip has much faster performance. The 7-Zip tool, which also supports opening GZip, may be pre-installed at schools and universities.

If your aim is personal archival, the format choices are freer. For example, you may wish to use LZip, whose developers claim to have designed it specifically for long-term archival and good recoverability. However, since it was developed in the late 2000s, more than a decade after GZip and ZIP, support is bundled with fewer operating systems and archive managers.[1]

Note that since the developer(s) of the LZO format have opted for their site "Oberhumer.com" (which hosts the source code files and documentation about the format) to be excluded from the Wayback Machine, it makes that format untrustworthy for long-term storage. It suggests that the developers lack interest in having the LZO format's source code files and documentation preserved.[2] Also, proprietary formats such as RAR are best avoided except if some of its functionality such as archive comments or fine-grain controls of its archive manager are specifically needed. An open-source tool for reading RAR files, "unrar", already exists though.[3]

Performance analysis between the formats have been conducted by CatchChallenger and LinuxAria.

Handling many small files

edit

In this experiment, we examine how efficiently various archive formats handle compressing many small files.

First, we create a high number of blank files.

In this experiment, this was done inside a RAM drive. How a RAM drive is created is out of scope for this page, and can be read elsewhere.

In the Linux terminal, which uses Bash, we enter:

mkdir sub1 # actual name used in the test
touch sub1/{00000..70000}

The number of files was picked because it is higher than 216, 65536, so the archive manager uses the ZIP64 format.

In Windows, a batch script loop that creates that many random files could be used.

::(to be added)

Now, we create archives in various formats. In this experiment, the "file roller" and "7z" utilities were used. For the latter, the command 7z a sub1-70K-file-test-2.7z sub1 was used, without additional parameters so the default compression parameters are made use of.

Results

These results were generated using the disk usage command line utility, using the du -s -h command. The bare "sub1" entry is the directory itself which contains the files.

1,4M sub1
531K sub1-70K-file-test.tar.gz
219K sub1-70K-files.tar.lz
220K sub1-70K-files.tar.xz
 24K sub1-70K-file-test-2.7z
125K sub1-70K-file-test.7z
177K sub1-70K-file-test.tar.bz2
4,1M sub1-70K-test.ar
5,9M sub1-70K-test.cpio
9,9M sub1-70K-test.ear
6,5M sub1-over65K.zip
 35M sub1.tar
219K sub1.tar.lzma
930K sub1.tar.lzo
9,9M sub1.war


The large size dispairity in some of the formats is caused by the table of contents not being compressed, since the file names can be read through a byte editor (also known as a hex editor). Additionally, in the .tar file, there is a gap of many null bytes between the items, which adds to the size of the resulting file.

References

edit
  1. Lzip manual
  2. https://web.archive.org/web/*/https://oberhumer.com/ "Sorry. This URL has been excluded from the Wayback Machine." (as of October 2022)
  3. https://unrar.org/developers/