Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AI Data Storage

Sixth 'Hutter Prize' Awarded for Achieving New Data Compression Milestone (hutter1.net) 64

Since 2006, Slashdot has been covering a contest CmdrTaco once summarized as "Compress Wikipedia and Win." It rewards progress on compressing a 1-billion-character excerpt of Wikipedia — approximately the amount that a human can read in a lifetime.

And today a new record was announced. The 1 billion characters have now been compressed to just 114,156,155 bytes — about 114 megabytes, or just 11.41% of the original size — by Saurabh Kumar, a New York-based quantitative developer for a high-frequency/algorithmic trading and financial services fund. The amount of each "Hutter Prize for Lossless Compression of Human Knowledge" increases based on how much compression is achieved (so if you compress the file x% better you receive x% of the prize). Kumar's compression was 1.04% smaller than the previous record, so they'll receive €5187.

But "The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI," said Marcus Hutter (now a senior researcher at Google DeepMind) in a 2020 interview with Lex Fridman.

17 years after their original post announcing the competition, Baldrson (Slashdot reader #78,598) returns to explain the contest's significance to AI research, starting with a quote from mathematician Gregory Chaitin — that "Compression is comprehension."

But they emphasize that the contest also has one specific hardware constraint rooted in theories of AI optimization: The Hutter Prize is geared toward research in that it restricts computation resources to the most general purpose hardware that is widely available. Why? As described by the seminal paper "The Hardware Lottery" by Sara Hooker, AI research is biased toward algorithms optimized for existing hardware infrastructure. While this hardware bias is justified for engineering (applying existing scientific understanding to the "utility function" of making money) to quote Sara Hooker, it "can delay research progress by casting successful ideas as failures."

The complaint that this is "mere" optimization ignores the fact that this was done on general purpose computation hardware, and is therefore in line with the spirit of Sara Hookers admonition to researchers in "The Hardware Lottery". By showing how to optimize within the constraint of general purpose computation, Saurabh's contribution may help point the way toward future directions in hardware architecture.

This discussion has been archived. No new comments can be posted.

Sixth 'Hutter Prize' Awarded for Achieving New Data Compression Milestone

Comments Filter:
    • I'm just spit ballin', but it probably involved:

      1. The rate at which an average human reads characters/time.
      2. The average waking life span of a human. (Knock off 6 years so you have time to learn to read)

      Then multiply those two? Maybe I'm crazy though...

    • by godrik ( 1287354 )

      I dont know.
      I read at about a page a minute (novel type). That's about 300 words per page, so maybe 4 characters per word in average.
      The average man lives to 73. take 3 out to learn to read. I probably can read about 14 hours a day. (You need to sleep, shower, eat.) that's:
      70*365*14*60 = 21 * 10^6 minutes
      So 300 * 4 * 21 Million = 25 Billion characters.

      But that is a "I don't do anything else than read" type of estimate.

      • I dont know. I read at about a page a minute (novel type). That's about 300 words per page, so maybe 4 characters per word in average. The average man lives to 73. take 3 out to learn to read. I probably can read about 14 hours a day. (You need to sleep, shower, eat.) that's: 70*365*14*60 = 21 * 10^6 minutes So 300 * 4 * 21 Million = 25 Billion characters.

        But that is a "I don't do anything else than read" type of estimate.

        But no one ever in the history of all humanity has had a life like that, spending 14 hours a day reading for 70 years. So the 1 billion being "approximately the amount that a human can read in a lifetime." is probably a lot closer to reality.

  • I thought AI was going to take all jobs. You would think AI would be able to find a compression algorithm much better than a human since it can run endless guess and test iterations.

    • by Rockoon ( 1252108 ) on Monday July 24, 2023 @01:07AM (#63710370)
      The last thing data compression needs, is endless random guesses.

      The statistics are all there, in the file being compressed. No need to guess. The data is finite.

      The whole tie in to AI is that the data isnt arbitrary, its statements in a language, and the algorithms they are developing are not for arbitrary data, they are for specific statements in that language.

      Its correlation all the way down. Find useful correlations, encode them into a model, rinse, repeat, until you can find no more within the constraints. All the research is in the encoding of these models, not the exact encoding of the file (entropy compression is solved and now unpatented, entropy modelling isnt)
  • by Excelcia ( 906188 ) <slashdot@excelcia.ca> on Sunday July 23, 2023 @11:41PM (#63710278) Homepage Journal

    The 1 billion characters have now been compressed to just 114,156,155 bytes — about 114 megabytes

    No, that's not "about 114 megabytes". That's a little under 109 megabytes. Get it right.

    • At least they got the ratio right, since the original text was "1 billion characters" and not "1 GB".

      • by tlhIngan ( 30335 )

        At least they got the ratio right, since the original text was "1 billion characters" and not "1 GB".

        1 character is not equal to one byte. 1 billion characters can be around 875MB or so if we are using ASCII, for example and packing them in by putting 8 characters in 7 bytes.

        Or, if we want to be wasteful, they can be 4 bytes using UTF-32 encoding, making it around 4GB or so. Wikipedia supports Unicode, after all.

        Or depending on the character sets, 1 billion characters can be around 1GB using variable length

        • English Wikipedia is UTF-8, so you're right - 1 gigabute is less than 1 billion characters assuming English Wikipedia has any characters at all that require more than 1 byte to represent, and that the contest organisers haven't made the effort to eliminate multi-byte characters in the test data.

    • by Roger W Moore ( 538166 ) on Monday July 24, 2023 @01:10AM (#63710378) Journal
      Technically it is about 114 megabytes but 109 mebibytes...which I know makes it sounds like you are uncertain whether they are bytes or not and I hate these new prefixes because they sound stupid but technically the OP was correct, the best type of correct.
      • by pjt33 ( 739471 )

        It's only "technically correct" in the context of standards written for the bodies that have standardised it to mean that. In general conversation words mean what the participants in the conversation agree that they mean.

        • Yeah, like how we all agreed that "pjt33" means "fun at parties".

          I kid, I kid... but this isn't general conversation since it's intended for a wide (well, Slashdot-wide) audience, and we don't have to pretend to be linguists pontificating about the abuse of language on the mean streets.

          (Meaning of a word != its standard definition) only works when nobody except the involved parties is going to see/hear it.

          Also, the text to be compressed is 1,000,000,000 characters, or one gigabyte. Mixing the units between

          • by pjt33 ( 739471 )

            Forget the mean streets: probably a substantial portion of the /. audience internalised gigabyte to mean 2^30 bytes before the IEC invented the gibi- prefix.

      • by Hodr ( 219920 )

        Unless and until base-10 computation becomes the standard and a byte becomes 10-whatevers a megabyte will always be 1,048,576 bytes.

        The IEC is full of shit and bought off by the storage companies.

        • You are mixing up "bytes" with "words".

          A byte can not become a 10bit unit. It is 8 bits, and traditionally could be either an 8bit byte or a 7bit character, which is now a byte, too.

          Words can be arbitrarily sized, but usually the length is dividable by 2.

      • by AmiMoJo ( 196126 )

        The JEDEC standard defines 1 megabyte as 1024 kilobytes or 1,048,576 bytes. The IEC can fuck off with their mebibyte bullshit. In fact whenever possible I like to sow confusion by having 1MiB=1,000,000 bytes.

        • JEDEC doesn't rule the world or indeed have much influence on anything except RAM.

          Memory buses make powers of two natural, but that doesn't apply to things like data storage or communications. It's a daft artefact of history that mega means different things in very close but context dependent places.

          As for confusion, why not just use the megbi bytes used in floppy disks. 1024*1000 is the best possible compromise.

          • by AmiMoJo ( 196126 )

            Not just RAM, all kinds of memories. Flash chips and EEPROMs too.

            Windows uses proper megabytes for all file and disk sizes too. Microsoft did it right for once.

            • Memories, such as flash etc sure, but not disks or tapes. Except flash now has so much dead space for wear levelling and discarding of dead blocks that while internally the size is a nice near power of two, externally you would see somewhat less than that because it can't present the whole lot to you. Talking obviously not about raw flash chips here, but flash drives.

              But for data transfer, the normal 10 based system makes more sense, because you talk about buses in MHz (or GHz) and mega/giga transfers per s

              • by AmiMoJo ( 196126 )

                Hmm, I don't know man. Bus speed is often given in gigabytes per second, and at least when I'm writing it I use powers of 2 for that. Baud has to be converted to bytes for many practical purposes, although granted for RS232 it's a division of 10.

                I'm all for metric/SI but 34.359738368 """gigabytes""" of RAM?

                • Just enjoying the low stakes nature of this by the way.

                  Bus speed is often given in gigabytes per second, and at least when I'm writing it I use powers of 2 for that.

                  Depends on the bus. PCIe is GT/s which is 4*10^9 kibibytes per second. Got some real 3.5" floppy energy there.

                  Baud has to be converted to bytes for many practical purposes, although granted for RS232 it's a division of 10.

                  yep, but then it's 1/10 of the baud rate, for bytes per second, so if the baud is 1,000,000, that's 100kib/s :)

                  I'm all for m

                  • by AmiMoJo ( 196126 )

                    You know, I could have sworn that audio cassette storage capacity was given in powers of 2 back in the day.

                    Floppy disks too. While some used powers of 10, most of the early ones were powers of 2. For example, the Amiga 880k format stores 90,1120 bytes, or exactly 880 real kilobytes. It made sense to express it that way because the disk sectors were 512 bytes.

                    Same with hard drives. The ST505 had 512 bytes per sector, 17 sectors per track, 306 tracks, and 4 heads, or 10,653,696 bytes of usable storage, which

                    • It was the 3.5" drive: 1.44 * 1024 * 1000 bytes. The pokemon approach to suffixes.

                    • by AmiMoJo ( 196126 )

                      Indeed. They should really have invented two new words, rather than trying to redefine "kilobyte" to mean something new. Whenever you see megabyte or MB, you have no real way of knowing what it actually means. Is it the new power of 10 type, or is it JEDEC, or is it just old?

                    • At least with "mebibyte" you know :D

    • They're also annoying nitpicky about what they'll accept a a solution. For example when I submitted my 1 gigabyte executable that compressed Wikipedia down to a single bit they rejected it based on some silly nonsense reason I can't remember.
      • Thanks for your comment. I truly laughed out loud and enjoyed it!

      • by Baldrson ( 78598 ) *

        You didn't submit an executable archive of enwik9 purported to expand into a file that matched bit for bit. While you also failed in some other ways to qualify, that is now the first bar you must clear, before any further investment by the judging committee.

        Or are you Yann LeCun out to pull my leg again?

        https://twitter.com/ylecun/sta... [twitter.com]

    • You are thinking of *mebibytes*
      https://tecadmin.net/differenc... [tecadmin.net]

  • If you're not alarmed when you see text with so many words emphasized, you may be on the wrong side of that divide. People who sprinkle bold (and worse, color) all over their texts to make a point usually do so because they feel they're not being understood, and usually for good reason.

  • Lossless compression, while obviously of great practical use, seems like a profoundly bad fit for a 'compression is comprehension' objective.

    There are obvious reasons to demand lossless for a variety of practical compression applications; but in the context of 'comprehension' requiring losslessness is essentially identical to forbidding comprehension; since losslessness implies that all bits are of equal importance, since you have to be able to reproduce them all; while everything we've seen that looks l
    • So what you are saying is we could still read wikipedia if th chratrs wer nt al der jus fin.

      No one wants to do that even if we can read it.

      • At least 2/3rds of your sentence is fully reconstructable by running it through a context free spell checker. A contextual based spell checker would be able to reconstruct the entire sentence.

        Basically if you're able to understand what you just wrote then we are capable of reconstructing it so you don't need to worry about missing characters, preserving both the comprehension while also objectively being lossy in its compressive form.

        The issue comes more with edge cases, e.g. where a missing comma could cha

        • But would that necessarily improve the compression ratio? Presumably, the more repetition you remove the higher the entropy. The gains might be nil or insignificant.
    • The fact that you are missing is that these compressors have a model that generates probabilities, and then merely (trivially) encode the sequence of symbols of a given probability optimally.

      The very thing you think these compressors are missing, they are doing internally better than you can imagine.

      Requiring replication of the file is a measurement of the efficiency of the model it is using. Its the step that turns the model probabilities into knowledge. If the probabilities the model is giving are wro
      • by nuntius ( 92696 )

        I believe that OP was getting at the difference between memorization/knowledge and understanding/comprehension.
        If you memorize a sentence, you can repeat it many times.
        If you understand a sentence, you can express it many different ways.
        The most efficient encoding may capture concepts and require extra bits to maintain specific word choices and sentence structures.

        • Unfortunately these discussions are counterproductive until the OP educates himself (or herself) on Kolmogorov Complexity theory and the MDL (minimum description length) principle at the very least. The soundbyte "compression is comprehension" is intended to summarize this area of science.
    • by AmiMoJo ( 196126 )

      The idea is that by finding more ways in which data correlates, and therefore can be reduced to just the correlations rather than the actual data, the amount of general intelligence needed to understand it can be reduced too.

      One of the big problems in AGI is what we might call common sense. Humans has a lot of life experience to draw on, they know how the world works on a deep level, to the point where they mostly don't have to consciously think about it. For AGI to have that understanding it was thought th

  • 1 billion characters

    Characters? Am I the only one who uses Wikipedia exclusively for the pictures and audio samples?!

  • by rklrkl ( 554527 ) on Monday July 24, 2023 @03:05AM (#63710496) Homepage

    This reminds me of Pied Piper's compression algorithm in the excellent HBO comedy "Silicon Valley" - I wonder if that was inspired by the Hutter Prize? One big mistake they made early on with the Hutter Prize was not insisting that all contestants make their entries Open Source.

    Considering that later winners often built on the work of earlier winners, this caused a slowdown in winning entries. Fortunately, they corrected this mistake and the latest winner did indeed build on the work of a previous winner thanks to the latter's source code being available. This is a classic example where being closed source severely hinders progress.

    • by Baldrson ( 78598 ) *

      One big mistake they made early on with the Hutter Prize was not insisting that all contestants make their entries Open Source.

      IIRC, only one entry was closed source. You may be thinking of Matt Mahoney's Large Text Compression Benchmark [mattmahoney.net] where the top contender is frequently closed source.

      That the machine learning world has yet to recognize lossless compression as the most principled loss function is a tragedy, but it is due to a lot more than that entry. This failure stretches back to when Solomonoff's

  • Shouldn't it really be "Huter Prize" -- which is 8.3% shorter? :-)

  • by EdZep ( 114198 ) on Monday July 24, 2023 @09:05AM (#63711038)

    ... as we learned in Silicon Valley

  • ...and just report the Weissman score already!

"It's the best thing since professional golfers on 'ludes." -- Rick Obidiah

Working...