The recently published paper with the title “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors  has received quite some public attention in the recent days. Their key finding is that they can in some cases outperform large language models like BERT using the simple idea that two text documents are more similar if they can be compressed to a smaller file size (although there is some controversy about the validity of their results, see this blog post and this discussion including the author’s reply).
The main idea behind their approach is that the “information distance” between two documents, as defined by Bennet et. al , is a good distance metric to use during text classification. Since the information distance itself isn’t computable, they approximate it using the Normalized Compression Distance (NCD) , which estimates the information distance using “real-life” data compressors like gzip. NCD has the property that better compressors (i.e. compressors with a better compression ratio) will allow for a better estimation of the true information distance.
It seems natural then to expect that better compressor will achieve superior performance in classification. But they cannot validate this experimentally; the best compressor considered in the paper, bz2, performs worse in terms of accuracy than gzip. They explain this as follows: “[…] the Burrows-Wheeler algorithm used by bz2 dismisses the information of character order by permuting characters during compression” [1, p.6817]. This implies that compression alone doesn’t explain their findings, but that it has something to do with character order as well.
This made me wonder: How much of their results are due to compression, and how much is due to the comparison of strings between two documents?
To investigate this question, I compare their results with two alternatives: (1) a simple compressor that relies solely on replacing recurring strings, and (2) an algorithm that explicitly does substring matching between a query document and all documents belonging to some topic.
A first ablation: Will LZ4 do the job?
The compression algorithm gzip is based on DEFLATE, which in term uses LZ77 and Huffman coding to compress the data. Let’s look at these two algorithms in more detail and think about what they imply when applied to our use-case.
During compression, LZ77 uses a sliding window over previously seen data. If a string of characters is repeated, a reference to the first occurrence of the character string is stored, instead of the string itself. Hence, if we take the length of two concatenated documents as a distance metric, documents will be closer together if they have more overlapping substrings within the size of the sliding window (typically 32KB).
The Huffman coding compresses the resulting text even further: Instead of using 8 bits for every character, it represents those letters that occur frequently with less bits and those letters that occur rarely with more bits. If we apply the Huffman coding to the concatenated documents, two documents would be smaller after compression if they use the characters with similar frequency.
One would expect matching substrings to be much more important to topic classification than similar character frequencies. I therefore do an ablation study by looking at the performance when using the LZ4 algorithm  for compression (basically LZ77 but with a fast implementation available in python). Since LZ4 has a much lower compression ratio than gzip, their explanation would suggest that performance of LZ4 is worse than gzip. If however the main thing going on is substring matching, LZ4 will perform just as well as gzip.
A more explicit algorithm
To go even further, I implement a simple algorithm that explicitly does the substring matching: It assigns documents to the topic with the most similar substrings (substrings being character-level n-grams here). It works as follows:
1. Extract all character n-grams within the text with 5 ≤ n ≤ 50.
2. For the extracted n-grams, calculate an 8-digit hash code using `hash(n_gram) % int(10e8)` in python (since I want to control how many different things to keep track of).
3. Keep track of this in sets (thus losing the information about how many times a given code occurs).
1. Calculate the set of hash codes for every text of a given topic.
2. Take the set union to obtain a set of hash codes that occur in the topic.
1. For some query text, calculate the set of its hashed n-grams.
2. For every topic encountered during training, calculate the size of the intersection between the topic sets and the query set.
3. Assign the query text to the topic with the largest intersection.
Experiment and Results
I compare the results of gzip, lz4 and the hashed n-gram algorithm in the 100-shot setting with 5 runs, as described in their paper. For all three methods I stick to their experimental setup in order to reproduce their reported results (again, leaving us with potentially inflated accuracy measures). The code can be found on github.
We see that lz4 and the hashed n-grams outperform gzip on all three considered data sets, with the hashed n-gram algorithm being best in 2 out of 3 data. But it still doesn’t compete with BERT, which has a performance of 0.82 on AG_NEWS and close to 1 on DBpedia according to their paper in the 100-shot setting.
These results have important practical implications: In our experiment, the lz4-based algorithm runs approximately 10x faster than the gzip-based algorithm. And even more importantly, the hashed n-gram algorithm even improves the computational complexity at inference time: Instead of having to compare a query document with every other document in the text corpus, you just have to do a comparison with every topic set.
What do we learn from this?
My results suggest that the driving force between the impressive reported performance of gzip can be attributed to the fact that their method implicitly compares character-level n-grams between documents. This finding allows for the use of faster compressors like lz4 without any performance loss. Furthermore, one can even rewrite their algorithm to have constant complexity with respect to dataset size at inference time, bringing their method a big step closer to practical use on big data sets. If you do want to use it in practice, I have started an implementation following the scikit-learn API of my proposed algorithm here.
The one question that remains is, why does this outperform the TF-IDF approach that also compares words between documents?
Maybe considering character-level n-grams does a better job for some tasks than splitting the text into individual words. But probably more importantly, the method here gives equal weight to all n-grams, regardless of their occurrence count. This means that it gives a lot of importance to so-called long-tail (read: rare) information, which apparently is relevant for some text tasks such as topic detection. Note that transformer networks don’t do too good a job on modelling such long-tail information (for evidence, see e.g. ), which is why these very simple approaches are a very interesting benchmark to measure your million-parameter model against.
 Z. Jiang, M. Yang, M. Tsirlin, R. Tang, Y. Dai, J. Lin. “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (2023), ACL 2023
 C. H. Bennett, P. Gács, M. Li, P. MB Vitányi, and W. H. Zurek, Information distance (1998), IEEE Transactions on information theory
 M. Li, X. Chen, X. Li, B. Ma, and P. Vitányi, The similarity metric (2004), IEEE transactions on Information Theory
 N. Kandpal, H. Deng, A. Roberts, E. Wallace, C. Raffel, Large Language Models Struggle to Learn Long-Tail Knowledge (2022), arxiv.org/abs/2211.08411
Thanks to Joel Niklaus and Marcel Gygli for our discussions and your feedback.