The provided source material details CMIX, a free, open-source lossless data compression programme. This software is designed to achieve a high compression ratio, optimising for data reduction at the cost of significant computational resources. The programme is distributed under the GNU General Public License, making it freely available for use, modification, and distribution. According to the documentation, CMIX is compatible with Linux, Windows, and Mac OS X operating systems. A key recommendation for running the programme is the availability of at least 32GB of RAM, indicating its resource-intensive nature.
CMIX operates as a context-mixing compressor, a sophisticated approach that combines multiple prediction models using neural networks to achieve state-of-the-art compression performance. The system processes data through several distinct phases: preprocessing, prediction, mixing, and encoding. The heart of the programme is the Predictor class, which orchestrates all compression models and the neural network mixing process. This class manages three primary subsystems, each combining multiple specialised compression models optimised for different data patterns.
The build system for CMIX produces three main executables, each with a specific function. The primary executable, cmix, serves as the main compression and decompression engine, integrating all components including models, mixers, encoders, and preprocessors. A second executable, enwik9-preproc, is a specialised preprocessor designed for the enwik9 benchmark, handling tasks such as article reordering and text processing. The third executable, remap, is an article remapping utility that implements article reordering algorithms.
The programme's architecture is built upon a simple Makefile-based system. Compilation requires the clang++-17 compiler. The main cmix executable is constructed from over 35 source files, including core components such as encoder.cpp, decoder.cpp, predictor.cpp, context-manager.cpp, and various model files like paq8.cpp, ppmd.cpp, and fxcmv1.cpp. The runner.cpp file implements the main entry point, providing a mode-based command-line interface for different operations.
The command-line interface supports several modes for file handling. For standard compression, the syntax is cmix -c [input] [output]. Dictionary-enhanced compression is achieved with cmix -c [dictionary] [input] [output]. A text mode, which forces text preprocessing, is available via cmix -t [dictionary] [input] [output]. For raw compression without any preprocessing, the command is cmix -n [input] [output]. The programme also offers a store-only mode (cmix -s [input] [output]) for preprocessing without compression, and a standard decompression mode (cmix -d [input] [output]).
It is important to note that CMIX is designed to compress or decompress single files only. To handle multiple files or directories, users must first create an archive file using a tool such as tar. For certain file types, the documentation suggests that preprocessing with an external tool called "precomp" may improve the final compression ratio.
Regarding performance, the documentation states that compiling with the flags -Ofast -march=native is recommended for the fastest performance. However, a caveat is provided: these compiler options may lead to incompatibility between different computers due to differences in floating-point precision. The programme has been tested on various platforms, with results from the Hutter Prize submission (cmix-hp v3) showing a decompression running time of approximately 42.8 hours, with a maximum RAM usage of 6905 MiB and disk usage of around 35GB on a system with 32GB of RAM.
The development of CMIX acknowledges contributions from many individuals in the data compression community, including Matt Mahoney, Alex Rhatushnyak, Eugene Shelwien, Márcio Pais, Kaido Orav, Mathieu Chartier, Fabrice Bellard, and Artemiy Margaritov. The project also thanks AI Grant for funding.
