Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage

New Hutter Prize Awarded for Even Smaller Data Compression Milestone (google.com) 22

Since 2006 Baldrson (Slashdot reader #78,598) has been part of the team verifying "The Hutter Prize for Lossless Compression of Human Knowledge," an ongoing challenge to compress a 100-MB excerpt of Wikipedia (approximately the amount a human can read in a lifetime).

"The intention of this prize is to encourage development of intelligent compressors/programs as a path to Artificial General Intelligence," explains the project's web site. 15 years ago, Baldrson wrote a Slashdot post explaining the logic (titled "Compress Wikipedia and Win AI Prize"): The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.
The amount of the prize also increases based on how much compression is achieved. (So if you compress the 1GB file x% better than the current record, you'll receive x% of the prize...) The first prize was awarded in 2006. And now Baldrson writes: Kaido Orav has just improved 1.38% on the Hutter Prize for Lossless Compression of Human Knowledge with his "fx-cmix" entry.

The competition seems to be heating up, with this winner coming a mere 6 months since the prior winner. This is all the more impressive since each improvement in the benchmark approaches the (unknown) minimum size called the Kolmogorov Complexity of the data.

This discussion has been archived. No new comments can be posted.

New Hutter Prize Awarded for Even Smaller Data Compression Milestone

Comments Filter:
  • ...middle-out compression.
  • by Calydor ( 739835 ) on Saturday February 10, 2024 @03:10PM (#64230578)

    The summary can't seem to make up its mind if we're compressing 100 MB or 1 GB.

    • Its hard to imagine writing a worse piece.

      Effort was clearly put into making sure its terrible.
    • The Task
      Losslessly compress the 1GB file enwik9 to less than 112MB. More precisely:

              Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2

      So the current objective is to compress 1 GB wikipedia to less than 112 MB

    • There are apparently two contests: one for 100MB and one for 1G. The new record applies to the 1G contest.

      IMHO, 1 GB is still too small. Traditional compression methods do better than neural networks for small datasets, but they don't scale. It's understandable why they don't do a "1% of CommonCrawl" contest as it would take forever to verify, but that really would've been better.
      • by Baldrson ( 78598 ) *

        There is only one Hutter Prize contest and it's for 1GB. 100MB was the original size for the Hutter Prize starting in 2006, but it was increased to 1GB in 2020, along with a factor of 10 increase in the payout per incremental improvement. See the "Hutter Prize History" [hutter1.net].

        Insofar as the size is concerned: The purpose of the Hutter Prize is research into radically better means of automated data-driven model creation, not biased by what Sara Hooker has called "The Hardware Lottery" [arxiv.org]. One of the primary limita

        • I'm not saying that neural nets make particularly good use of the data. What they have is scalability.

          My point is that English text might be a weak indication of the state of the world. People learn how the world works from a lot of extra data, and a very large amount of it if we count everything our senses pick up. So a method that optimizes for heavily compressing a small amount of text may face inherent limits before it becomes useful in an AI context.

          To put it differently: suppose that the text wa
          • by Baldrson ( 78598 ) *

            kvezach writes: To put it differently: suppose that the text was 10 bytes long.

            A better way of thinking about data scaling is to ask "How many digits of Pi would it take before, say, John Tromp's 401-bit binary lambda algorithm that generates Pi [github.io] would become a better model than the literal string of those apparently-random digits?" (And by "better" I mean not only that it would be shorter than those digits, but that it would extrapolate to (ie: "predict") the next digit of Pi.)

            In terms of how much data huma

          • I think you're conflating the scalability of the AI infrastructure (the ability to train billions of parameters on terabytes of data by throwing enough current hardware at the problem) with the inherent scalability of the AI models themselves.

            We have certainly figured out the first type of scalability in 2024, but it is not at all true that we have cracked the second type. To the contrary, there are strong reasons to believe that in fact our current AI models are extremely unscalable, much too parameter h

        • by AmiMoJo ( 196126 )

          Another interesting requirement for the prize is:

          "Must run in â50 hours using a single CPU core and 10GB RAM and 100GB HDD on our test machine."

          Unfortunately the link to the test machine is dead, but it's probably fairly low end by modern standards. This is good because there are lots of very good compression algorithms that are practically useless because they are so slow or require so much memory.

          It's a shame they don't include any non-text data in their data file. I suppose they are focusing on text

      • by Rei ( 128717 )

        Also, given the prize's goals, having it be about perfect compression makes no sense. If you're demanding perfection, then it's not simply about the underlying structure to reality, as that will always have exceptions. So you're forced to simultaneously model reality but also find a way to encode the exceptions... the latter part of which isn't really the goal.

        I'd think that given the intent of the prize, the rating should be a combination of (A) the degree of compression, and (B) the accuracy of the compre

        • by Kaenneth ( 82978 )

          Wouldn't accuracy be folded into the overall size?

          If you have a model that generates the gigabyte of text, but with a number of known errors, you just append the errata data to the end for a correction pass.

          A 100% accurate model would need 0 errata, so the error rate just becomes a size a penalty.

          • by Rei ( 128717 )

            No, because it's also an errata compression task.

            The simple fact is that people are going to choose different wording (or even *wrong* wording) to describe the same thing. Learning to compress *that* isn't useful.

  • AI can't spell "Occam". Just like the AI voices that have trouble with English sentences.
  • by CrappySnackPlane ( 7852536 ) on Saturday February 10, 2024 @05:44PM (#64230886)

    Where do I claim my share of the prize?

  • William of Occam. No "K" in there. No "H" in there. Occam's razor illiterate weekend slashdot editor turd.

    • by Kaenneth ( 82978 )

      Ironic.

      Try reading next time.

    • William of Ockham (often spelled Occam) was born in a town called Ockham in Surrey, England. His name was William. "ham" is very common in town names in English, since ham in Old English means village (more or less) and the Anglo Saxons liked naming settlements in literal terms.

      His name was often Latinized as "Occam", but Ockham is a valid and correct spelling.

      So, the slashdot editor is not the illiterate turd here.

  • We are likely nowhere near the true Kolmogorov complexity. Note the restriction on running time/space. The Kolmogorov complexity is defined without regard for running time and, in all likelihood, that's going to use some algorithm that's hugely super exponential (with large constant) in time and space.

  • I find that in a lot of cases, CPU and RAM are relatively cheap, while space is expensive, especially cloud storage, or even storage in general. Even the relatively tiny improvement done by the "-e" option in "xz -9e" can save a lot of money over time, especially with something like Borg Backup, where once the initial data is compressed, subsequent stuff is deduplicated. Even a 1-2% improvement, though it might take massive amounts of CPU time, can be a good thing to have. For example, long term archive

According to the latest official figures, 43% of all statistics are totally worthless.

Working...