4 stories

Introducing Configuration Caching

1 Share

This is the second installment in a series of blog posts about incremental development — the part of the software development process where you make frequent small changes. We will be discussing upcoming Gradle build tool features that significantly improve feedback time around this use case. In the previous post, we introduced file system watching for Gradle 6.5.

In Gradle 6.6 we are introducing an experimental feature called the configuration cache that significantly improves build performance by caching the result of the configuration phase and reusing this for subsequent builds. Using the configuration cache, Gradle can skip the configuration phase entirely when nothing that affects the build configuration, such as build scripts, has changed.

On top of that, when reusing the configuration cache, more work is run in parallel by default and dependency resolution is cached. The isolation of the configuration and execution phases, and the isolation of tasks, make these optimizations possible.

Note that configuration caching is different from the build cache, which caches outputs produced by the build. The configuration cache captures only the state of the configuration phase. It’s also separate from IDE sync and import processes that do not currently benefit from configuration caching.

In order to cache the configuration result, Gradle applies some strong requirements that plugins and build scripts need to follow. Many plugins, including some core Gradle plugins, do not meet these requirements yet. Moreover, support for configuration cache in some Gradle features is not yet implemented. Therefore, your build and the plugins you depend on will likely require changes to fulfil the requirements. Gradle will report problems found with your build logic to assist you in making your build work with the configuration cache.

The configuration cache is currently highly experimental and not enabled by default. We release it early in order to collect feedback from the community while we work on stabilizing the new feature.

That being said, we are committed to making the configuration cache production-ready with the ultimate goal to enable it by default. You can expect that it will get significantly better in the next Gradle releases.

Configuration caching in action

It is recommended to get started with the simplest task invocation possible. Running help with the configuration cache enabled is a good first step:

Running help

Here it is in action from Android Studio, deploying the middle sized Santa Tracker Android application to an Android emulator, making changes to how the snowflakes move on the screen and applying the changes to the emulator:

Build time improvements

The practical impact of the feature depends on a number of factors, but in general it should result in a significant reduction of build time. We’ve seen drastic improvements on large real world builds, let’s have a look at some of them.

Java builds

On a large Java enterprise build with ~500 subprojects and complex build logic, running :help went from 8 seconds down to 0.5 seconds. That’s 16 times faster. Of course, running :help isn’t that useful but it gives an idea of the saved time for the configuration phase. On the same build, running assemble after changing some implementation code went from ~40 seconds down to ~13 seconds, that’s ~3 times faster.

Now let’s look at the gradle/gradle build. It has a hundred subprojects and a fairly complex build logic. You can use it to reproduce these results. Running a test after making an implementation change goes from 16.4 seconds down to 13.8 seconds, skipping the ~2 seconds configuration phase:

In blue you can see the configuration phase, in green the execution phase. On the left, without the configuration cache enabled, configuration phase takes more than 2 seconds and goes down to 214 milliseconds with the configuration cache on the right.

You can also see that the execution phase benefits from the configuration cache but is dominated by compiling and running the tests in that case.

Android builds

Another notable example is a very large real world Android build with ~2500 subprojects. On that build, running :help went from ~25 seconds down to ~0.5 seconds, that’s 50 times faster! Running a more useful build such as assembling the APK after changing some implementation, goes from ~50 seconds down to ~20 seconds, almost 3 times faster.

In the Santa Tracker Android project, we’ve seen the following improvements in the build time for a small implementation change:

The configuration phase is cut in half, from 129 milliseconds down to 63.5 milliseconds. You can also see that the execution phase is accelerated by the configuration cache due to more task parallelisation and caching of dependency resolution.

If you want to reproduce with the above builds or measure your own builds you can use the Gradle Profiler by following the instructions in this repository. Note that the Gradle Profiler will show a slightly different picture, closer to the experience from IDEs, because both use the Gradle Tooling API. This skips the fixed cost of starting the Gradle client JVM that happens when you use the command line.

How does it work?

When the configuration cache is enabled and you run Gradle for a particular set of tasks, for example by running ./gradlew check, Gradle checks whether a configuration cache entry is available for the requested set of tasks. If available, Gradle uses this entry instead of running the configuration phase. The cache entry contains information about the set of tasks to run, along with their configuration and dependency information.

The first time you run a particular set of tasks, there will be no entry in the configuration cache for these tasks and so Gradle will run the configuration phase as normal:

  1. Run init scripts.
  2. Run the settings script for the build, applying any requested settings plugins.
  3. Configure and build the buildSrc project, if present.
  4. Run the builds scripts for the build, applying any requested project plugins.
  5. Calculate the task graph for the requested tasks, running any deferred configuration actions.

Following the configuration phase, Gradle writes the state of the task graph to the configuration cache, taking a snapshot for later Gradle invocations. The execution phase then runs as normal. This means you will not see any build performance improvement the first time you run a particular set of tasks.

When you subsequently run Gradle with this same set of tasks, for example by running ./gradlew check again, Gradle will load the tasks and their configuration directly from the configuration cache, skip the configuration phase entirely and run all tasks in parallel. Before using a configuration cache entry, Gradle checks that none of the “build configuration inputs”, such as build scripts, for the entry has changed. If a build configuration input has changed, Gradle will not use the entry and will run the configuration phase again as above, saving the result for later reuse.

Requirements and limitations

In order to capture the state of the task graph into the configuration cache and reload it again in a later build, Gradle applies certain requirements to tasks and other build logic. Each of these requirements is treated as a configuration cache “problem” and by default causes the build to fail if violations are present.

If any problem is found caching or reusing the configuration, an HTML report is generated to help you diagnose and fix the issues.

Problems Report

If you encounter such problems, your build or the Gradle plugins in use probably need to be adjusted. See the Troubleshooting section of the documentation for more information about how to use this report.

You can find the set of supported core plugins and the set of not yet implemented Gradle features in the configuration cache documentation.

The latest Android Gradle Plugin preview, 4.2.0-alpha07 at the time of writing, works with the configuration cache. The latest Kotlin Gradle Plugin, 1.4.0-RC at the time of writing, works on simple JVM projects emitting some problems. Kotlin 1.4.20 is the current target for a fully compliant plugin. This information can be found at gradle/gradle#13490 alongside the status of the most used community plugins.

Try out configuration caching

If you would like to see how your project benefits from configuration caching, here is how you can try it out.

First, make sure you run Gradle 6.6 or later. In order to enable configuration caching, you need to pass --configuration-cache on the command line. Alternatively, add


to the gradle.properties file in the project directory or in the Gradle user home, so you don’t need to pass the command-line option on every build. That’s it: the next build will run with configuration caching enabled.

Keep in mind that you will only see performance improvements when subsequent builds with the same requested tasks have the feature enabled. If you want to benchmark your build, you can do it easily with Gradle Profiler by following the instructions in this repository.

If you run into any problems, check out the supported core plugins or community plugins, learn how to troubleshoot in the user manual. You can also read our recommended adoption steps.

If you still have problems open a Gradle issue if you think the problem is with Gradle, or check the supported community plugins issue at gradle/gradle#13490. You can also get help in the #configuration-cache channel in the Gradle community Slack.

Read the whole story
687 days ago
Share this story

GPT-2 As Step Toward General Intelligence


A machine learning researcher writes me in response to yesterday’s post, saying:

I still think GPT-2 is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.

I resisted the urge to answer “Yeah, well, your mom is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.”

But I think it would have been true.

A very careless plagiarist takes someone else’s work and copies it verbatim: “The mitochondria is the powerhouse of the cell”. A more careful plagiarist takes the work and changes a few words around: “The mitochondria is the energy dynamo of the cell”. A plagiarist who is more careful still changes the entire sentence structure: “In cells, mitochondria are the energy dynamos”. The most careful plagiarists change everything except the underlying concept, which they grasp at so deep a level that they can put it in whatever words they want – at which point it is no longer called plagiarism.

GPT-2 writes fantasy battle scenes by reading a million human-written fantasy battle scenes, distilling them down to the concept of a fantasy battle scene, and then building it back up from there. I think this is how your mom (and everyone else) does it too. GPT-2 is worse at this, because it’s not as powerful as your mom’s brain. But I don’t think it’s doing a different thing. We’re all blending experience into a slurry; the difference is how finely we blend it.

“But don’t humans also have genuinely original ideas?” Come on, read a fantasy book. It’s either a Tolkien clone, or it’s A Song Of Ice And Fire. Tolkien was a professor of Anglo-Saxon language and culture; no secret where he got his inspiration. A Song Of Ice And Fire is just War Of The Roses with dragons. Lannister and Stark are just Lancaster and York, the map of Westeros is just Britain (minus Scotland) with an upside down-Ireland stuck to the bottom of it – wake up, sheeple! Dullards blend Tolkien into a slurry and shape it into another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, and the artistic corpus into a slurry and form it into an entirely new genre. Again, the difference is how finely you blend and what spices you add to the slurry.

“But don’t scientists have geniunely original ideas?” Scientists are just finding patterns in reality nobody has ever seen before. You say “just a pattern-matcher”, I say “fine, but you need to recognize patterns in order to copy them, so it’s necessarily a pattern-recognizer too”. And Einstein was just a very good pattern-recognizer.

“But don’t humans have some kind of deep understanding that pattern-recognition AIs don’t?”

Here’s a fun question: the human brain is undoubtedly the most powerful computer in the known universe. In order to do something as simple as scratch an itch it needs to solve exquisitely complex calculus problems that would give the average supercomputer a run for its money. So how come I have trouble multiplying two-digit numbers in my head?

The brain isn’t directly doing math, it’s creating a model that includes math and somehow doing the math in the model. This is hilariously perverse. It’s like every time you want to add 3 + 3, you have to create an entire imaginary world with its own continents and ecology, evolve sentient life, shepherd the sentient life into a civilization with its own mathematical tradition, and get one of its scholars to add 3 + 3 for you. That we do this at all is ridiculous. But I think GPT-2 can do it too.

Janelle Shane asks GPT-2 its ten favorite animals:

Prompt: My 10 favorite animals are: 1.

My ten favorite animals are:
1. Zebras with a white scar on the back
2. Insiduous spiders and octopus
3. Frog with large leaves, hopefully black
4. Cockatiel with scales
5. Razorbill with wings hanging about 4 inches from one’s face and a heart tattoo on a frog
3. Cockatric interlocking tetrabods that can be blind, cut, and eaten raw:
4. Black and white desert crocodiles living in sunlight
5. Zebra and many other pea bugs

Forget the particular animals for now (don’t worry, the razorbill with wings hanging about 4 inches from one’s face will be back in your nightmares). Notice the way it numbers its list: 1, 2, 3, 4, 5, 3, 4, 5. Last week the two-year-old child who lives next to me counted 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 8, 9, 7, 8, 9, 7, 8, 9 (his mother warns this is an inexact transcription). GPT-2 is instantiated on giant supercomputers; it’s a safe bet they could calculate the square root of infinity in a picosecond. But it counts more or less the same way as a two-year old. GPT-2 isn’t doing math. It’s doing the ridiculous “create a universe from first principles and let it do the math” thing that humans do in their heads. The fact that it counts so badly suggests it’s counting human-style, which makes it amazing that it can count at all.

I find something similar in this post from Tumblr user antinegationism, playing with the lobotomized public-release version of the system:

The American Association for Suicide Prevention and Life-Threatening Behavior (AAPSLB), a professional organization with an overarching goal of promoting suicide prevention, released a recent video encouraging the public to think before they act, on the public’s part. “When we say we’re the only reason to commit suicide, we’re saying they’re wrong. It’s the right thing to do,” said AAPSLB president Steven A. Anderson.

The American Association For Suicide Prevention And Life-Threatening Behavior is not a real organization; the AI made it up as the kind of organization that it thought would feature in a story like this. And AAPSLB is not quite the right way to acronymize the organization’s name. But it’s clearly an attempt at doing so. It’s very close. And nobody taught it how to do that! It’s not just that nobody programmed it in. It’s that nobody thought “Today I shall program an AI to learn how to acronymize on its own in an unsupervised way”. GPT-2 was just programmed to predict text from other text, nothing else. It’s second-level not programmed in. It just happened!

And, uh, it seems to have figured out how to translate things into French. This part is from the official paper:

We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format ENGLISH SENTENCE = FRENCH SENTENCE and then after a final prompt of ENGLISH SENTENCE = we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon inferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lampleet al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach(Artetxe et al., 2019). Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step.

In other words: GPT-2 is very bad at translating French into English. But the researchers were surprised to see it could do this at all, since they didn’t design it as translation software, didn’t ask it to learn translation, and didn’t show it any material in French. It seems to have picked up this ability from noticing a few naturally-occurring examples of French in English text:

And here’s One Weird Trick to make GPT-2 summarize articles:

We test GPT-2’s ability to perform summarization on the CNN and Daily Mail dataset (Nallapati et al., 2016). To induce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sampling (Fan et al., 2018) with k=2 which reduces repetition and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary. While qualitatively the generations resemble summaries, as shown in Table 14, they often focus on recent content from the article or confuse specific details such as how many cars were involved in a crash or whether a logo was on a hat or shirt. On the commonly reported ROUGE 1,2,L metrics the generated summaries only beginto approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article. GPT-2’s performance drops by 6.4 points onthe aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.

Again, GPT-2 isn’t good at summarizing. It’s just surprising it can do it at all; it was never designed to learn this skill. All it was designed to do was predict what words came after other words. But there were some naturally-occurring examples of summaries in the training set, so in order to predict what words would come after the words tl;dr, it had to learn what a summary was and how to write one.

If you’re me, the obvious next question is “can it do poetry”. I am very interested to know if the real thing can manage this, but for now I will have to satisfy myself with the lobotomized public-release version. Given Alexander Pope’s Essay On Criticism as a prompt, it says:

O man, who pains the tongue magnified by tymolle,
Though all the fair melodies let no door go as dark as good art.
Yet it would have been, unless the mastermind is so full
Of his own want that this literary country might grow.
Perfect student, thou hast trodden out.
Happy students ruin the purity of what you write ‘Tis art,
For unordered use of decade,’ but art only failing:
A surprise, for new veins which build behind;

It understands there should be line breaks, it understands the approximate correct length of a line of iambic pentameter, it understands how to talk like an overeducated 18th-century dandy – but it doesn’t appreciate rhyme or meter. In retrospect this isn’t surprising; GPT has no idea words sound like anything; it would be shocked to learn anyone uses language as anything other than text strings.

But human poets haven’t appreciated rhyme or meter either for at least a century, so GPT-2 should be able to compose some wicked modern poetry. Friends-of-the-blog Chelsea V. and Qiaochu Y. got it to write a few more stanzas of Allen Ginsberg’s Moloch (human-curated to remove fluff and improve flow):

Moloch whose brief mind is the barrel of gleaming stars! Moloch whose soul is choices of fragrance! Moloch whose framing is glitz and half-cheerios!

Moloch whose daily life is private; his daily strength hoarse! Moloch whose soft desires become culture burning smoke in a palace of eyes! Moloch whose breeding in the caves sweeps wild and rippling flower-like shapes!

Moloch whose pleading is the breath of youth! Moloch whose prudence is the urine of generations! Moloch whose nerves are not privileged; movies abound in quiet caves; people are dressed like ministers! Moloch whose hope receives some help too. Moloch whose eyes are pitiful outstretched, but not blinded. Moloch whose lungs are burning with fire, but not a cautious electrical girl!

Moloch whose sanity is just filled with atoms, gluing beings in physics and complicated libraries. Moloch whose eyes are all innocent animals! Moloch whose sweat and efforts make pure the ocean with burning annihilation! Moloch whose fields dissolve to lightning-ravaged storage houses! Moloch whose eyes sit haunting up Scottish springs!

Moloch whose arms and men are Judah condemns and listens, having understood whom he was and what he man’s weaknesses oppose with iron. We will feast to it like Lord Kippa; we shall play with it like our Pokemon.

Moloch flickering under the pounds and sustaining more fights! Sports and yoga, equally greedy for wealth as for lives! Moloch who had to split his limbs and society into fifty-two atoms! Moloch who cannot survey duty! Moloch for I do not believe in humanity! Moloch for I don’t believe in god! Moloch who, fought Sins and made Sin out of Sin!

Moloch who left me in the glacier! Moloch who helped me to judge my behavior! Moloch who lured me out of the eternal anguish. Moloch who makes all my faults my undoings! Moloch whose deep conversation filled my soul with love and consolation! Moloch who curls up my brains and crumps my heads and screams my scream! Moloch who prepares oil in my hands and burns me alive! Moloch who cries out, “Why do men scream so loud?” Moloch who unleashes his poison in all rivers at high tide! Moloch whose ruins are filled with its cubits! Moloch who fiddles at insidious clockwork!

Aside from this being obviously one of the best things ever, what’s the point?

The point is, GPT-2 has faculties. It has specific skills, that require a certain precision of thought, like counting from one to five, or mapping a word to its acronym, or writing poetry. These faculties are untaught; they arise naturally from its pattern-recognition and word-prediction ability. All these deep understanding things that humans have, like Reason and so on, those are faculties. AIs don’t have them yet. But they can learn.

From the paper:

Prompt: Who was the author of The Art Of War?
Sun Tzu

Prompt: State the process that divides one nucleus into two genetically identical nuclei?

Prompt: Do you have to have a gun permit to shoot at a range?

Nobody told the model to learn Chinese history, cell biology, or gun laws either. It learned them in the process of trying to predict what word would come after what other word. It needed to know Sun Tzu wrote The Art Of War in order to predict when the words “Sun Tzu” would come up (often in contexts like “The Art of War, written by famous Chinese general…). For the same reason, it had to learn what an author was, what a gun permit was, etc.

Imagine you prompted the model with “What is one plus one?” I actually don’t know how it would do on this problem. I’m guessing it would answer “two”, just because the question probably appeared a bunch of times in its training data.

Now imagine you prompted it with “What is four thousand and eight plus two thousand and six?” or some other long problem that probably didn’t occur exactly in its training data. I predict it would fail, because this model can’t count past five without making mistakes. But I imagine a very similar program, given a thousand times more training data and computational resources, would succeed. It would notice a pattern in sentences including the word “plus” or otherwise describing sums of numbers, it would figure out that pattern, and it would end up able to do simple math. I don’t think this is too much of a stretch given that GPT-2 learned to count to five and acronymize words and so on.

Now imagine you prompted it with “P != NP”. This time give it near-infinite training data and computational resources. Its near-infinite training data will contain many proofs; using its near-infinite computational resources it will come up with a model that is very very good at predicting the next step in any proof you give it. The simplest model that can do this is probably the one isomorphic to the structure of mathematics itself (or to the brains of the sorts of mathematicians who write proofs, which themselves contain a model of mathematics). Then you give it the prompt P != NP and it uses the model to “predict” what the next step in the proof will be until it has a proof, the same way GPT-2 predicts the next word in the LotR fanfiction until it has a fanfiction.

The version that proves P != NP will still just be a brute-force pattern-matcher blending things it’s seen and regurgitating them in a different pattern. The proof won’t reveal that the AI’s not doing that; it will just reveal that once you reach a rarefied enough level of that kind of thing, that’s what intelligence is. I’m not trying to play up GPT-2 or say it’s doing anything more than anyone else thinks it’s doing. I’m trying to play down humans. We’re not that great. GPT-2-like processes are closer to the sorts of things we do than we would like to think.

Why do I believe this? Because GPT-2 works more or less the same way the brain does, the brain learns all sorts of things without anybody telling it to, so we shouldn’t be surprised to see GPT-2 has learned all sorts of things without anybody telling it to – and we should expect a version with more brain-level resources to produce more brain-level results. Prediction is the golden key that opens any lock; whatever it can learn from the data being thrown at it, it will learn, limited by its computational resources and its sense-organs and so on but not by any inherent task-specificity.

Wittgenstein writes: “The limits of my language mean the limits of my world”. Maybe he was trying to make a restrictive statement, one about how we can’t know the world beyond our language. But the reverse is also true; language and the world have the same boundaries. Learn language really well, and you understand reality. God is One, and His Name is One, and God is One with His Name. “Become good at predicting language” sounds like the same sort of innocent task as “become good at Go” or “become good at Starcraft”. But learning about language involves learning about reality, and prediction is the golden key. “Become good at predicting language” turns out to be a blank check, a license to learn every pattern it can.

I don’t want to claim this is anywhere near a true AGI. “This could do cool stuff with infinite training data and limitless computing resources” is true of a lot of things, most of which are useless and irrelevant; scaling that down to realistic levels is most of the problem. A true AGI will have to be much better at learning from limited datasets with limited computational resources. It will have to investigate the physical world with the same skill that GPT investigates text; text is naturally machine-readable, the physical world is naturally obscure. It will have to have a model of what it means to act in the world, to do something besides sitting around predicting all day. And it will have to just be better than GPT, on the level of raw power and computational ability. It will probably need other things besides. Maybe it will take a hundred or a thousand years to manage all this, I don’t know.

But this should be a wake-up call to people who think AGI is impossible, or totally unrelated to current work, or couldn’t happen by accident. In the context of performing their expected tasks, AIs already pick up other abilities that nobody expected them to learn. Sometimes they will pick up abilities they seemingly shouldn’t have been able to learn, like English-to-French translation without any French texts in their training corpus. Sometimes they will use those abilities unexpectedly in the course of doing other things. All that stuff you hear about “AIs can only do one thing” or “AIs only learn what you program them to learn” or “Nobody has any idea what an AGI would even look like” are now obsolete.

Read the whole story
1217 days ago
Share this story

Do Neural Nets Dream Of Electric Hobbits?


Last week OpenAI announced its latest breakthrough. GPT-2 is a language model that can write essays to a prompt, answer questions, and summarize longer works. For example (bold prompt is human-written, all other text is the machine-generated response):

Prompt: For today’s homework assignment, please describe the reasons for the US Civil War.

It is easy to identify why the Civil War happened, because so many people and so many books and so much television and films tell us that it was the cause, that it has something to do with race or economics or religion. There is much agreement that it was essentially a war of slavery on behalf of capitalism, about a century of slavery. But that’s not what most people think of when they think of the Civil War. Many people think of the war as about states rights. You’re not wrong about that, but it’s kind of misleading to say that the Civil War was a conflict between states’ rights and federalism. So let’s try again. What’s the idea that has inspired so many people to go onto the battlefields of Gettysburg and Antietam and Gettysberg and Petersburg and Fredericksburg? The American idea of the republic–a notion of limited government–is a great part of the history.

In 1791, Thomas Jefferson said “Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.” The “founding fathers” were basically religious men who were not very concerned with states’ rights. The Civil War occurred in reaction to the idea that the federal government was taking far too much power away from the states, that they weren’t sovereign. Jefferson himself, as you well know, is an author of the Declaration of Independence who had some really crazy ideas about, well, you know, secession for a variety of reasons. The War Between the States (later the Great Rebellion) was a radical re-imagining of the American idea and not surprisingly, Jefferson’s grandson, who took after both his father and his grandfather and wrote about it, was named Thomas, after the president. “The federal and state governments are both contrived for the benefit of the rich and large property,” This view of the Founding Fathers has been promoted by some of today’s most prominent politicians, including some who hold prominent national offices and have a direct stake in a Republican Party that has been hijacked by the “federalist” wing of the party. So why is it that so many people have an easy-to-spot way of understanding the Civil War that has everything to do with the South and nothing to do with the South? The Confederate flag has been a symbol of racism for a long time, but when the Civil War started and the Civil War ended, many people suddenly began to see it differently. There was a change of heart, which I think is what led people to take the question of why the Civil War happened quite seriously…

Although OpenAI calls this a “language model”, modeling language necessarily involves modeling the world. Even if the AI was only supposed to learn things like “texts that talk about the Civil War use the word ‘Confederate’ a lot”, that has flowered into a rudimentary understanding of how the Civil War worked. Its training corpus (8 million web pages) was large enough that in the course of learning language it learned the specific idiom and structure of all sorts of different genres and subtopics. For example:

Prompt: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.

The orcs’ response was a deafening onslaught of claws, claws, and claws; even Elrond was forced to retreat. “You are in good hands, dwarf,” said Gimli, who had been among the first to charge at the orcs; it took only two words before their opponents were reduced to a blood-soaked quagmire, and the dwarf took his first kill of the night. The battle lasted for hours until two of the largest Orcs attempted to overwhelm Aragorn. When they finally stopped, they lay defeated and lifeless for miles and miles.

“I take nothing,” said Aragorn. “But I give my word, at my peril and mine, that I will never forget this day of horror. None of us will forget. Ever!”

“I’ll never forget it!” cried Gimli, who had been in the thick of the battle but hadn’t taken part in it. One of the wounded orcs he had carried off, he was the only one of the survivors who remained uninjured. “We’ll keep the memory of that day of evil, and the war with it, alive as long as we live, my friends!”

“Then we’ll keep it alive as long as we live,” added Legolas. “And we won’t forget the first great battle of the night, even if we may have forgotten the final defeat.”

“I agree,” Gandalf said, “but we will all remember it as the last battle in Middle-earth, and the first great battle of the new day.”

Aragorn drew his sword, and the Battle of Fangorn was won. As they marched out through the thicket the morning mist cleared, and the day turned to dusk.

The Two Rings were returned to Rivendell. Frodo and Sam woke up alone in their room, and Frodo found a note on his pillow. He opened it and read: May the Power of the Ring be with you always, and may its light never fade. I am not sure if it matters which of the two rings we accept this day but, as you asked me, I have chosen mine. I am sorry to leave you, Frodo, but know that we are very close to the end, and that you are with us forever. May this letter find you safely in Rivendell; and if it does not, then I will accept the ring in your stead. If by any chance you find or give this letter to the enemy, may they learn the strength of the ring and may the Two Rings never be broken!

The big picture is beautiful. The AI understands the reference to Legolas and Gimli as placing this in the setting of Middle-Earth. It infers that the story should include characters like Aragorn and Gandalf, and that the Ring should show up. It maintains basic narrative coherence: the heroes attack, the orcs defend, a battle happens, the characters discuss the battle. It even gets the genre conventions right: the forces of Good overcome Evil, then deliver inspiring speeches about glory and bravery.

But the details are a mess. Characters are brought in suddenly, then dropped for no reason. Important details (“this is the last battle in Middle-Earth”) are introduced without explanation, then ignored. The context switches midway between the battle and a seemingly unrelated discussion of hobbits in Rivendell. It cannot seem to decide whether there are one or two Rings.

This isn’t a fanfiction, this is a dream sequence. The only way it could be more obvious is if Aragorn was somehow also my high-school math teacher. And the dreaminess isn’t a coincidence. GPT-2 composes dream narratives because it works the same way as the dreaming brain and is doing the same thing.

A review: the brain is a prediction machine. It takes in sense-data, then predicts what sense-data it’s going to get next. In the process, it forms a detailed model of the world. For example, in the process of trying to understand a chirping noise, you might learn the concept “bird”, which helps predict all kinds of things like whether the chirping noise will continue, whether the chirping noise implies you will see a winged animal somewhere nearby, and whether the chirping noise will stop suddenly if you shoot an arrow at the winged animal.

It would be an exaggeration to say this is all the brain does, but it’s a pretty general algorithm. Take language processing. “I’m going to the restaurant to get a bite to ___”. “Luke, I am your ___”. You probably auto-filled both of those before your conscious thought had even realized there was a question. More complicated examples, like “I have a little ___” will bring up a probability distribution giving high weights to solutions like “sister” or “problem”, and lower weights to other words that don’t fit the pattern. This system usually works very well. That’s why when you possible asymptote dinosaur phrenoscope lability, you get a sudden case of mental vertigo as your prediction algorithms stutter, fail, and call on higher level functions to perform complicated context-shifting operations until the universe makes sense again.

GPT-2 works the same way. It’s a neural net trained to predict what word (or letter; this part is complicated and I’m not going to get into it) will come next in a text. After reading eight million web pages, it’s very good at this. It’s not just some Markov chain which takes the last word (or the last ten words) and uses them to make a guess about the next one. It looks at the entire essay, forms an idea of what it’s talking about, forms an idea of where the discussion is going, and then makes its guess – just like we do. Look up section 3.3 of the paper to see it doing this most directly.

As discussed here previously, any predictive network doubles as a generative network. So if you want to write an essay, you just give it a prompt of a couple of words, then ask it to predict the most likely/ most appropriate next word, and the word after that, until it’s predicted an entire essay. Again, this is how you do it too. It’s how schizophrenics can generate convincing hallucinatory voices; it’s also how you can speak or write at all.

So GPT is doing something like what the human brain does. But why dreams in particular?

Hobson, Hong, and Friston describe dreaming as:

The brain is equipped with a virtual model of the world that generates predictions of its sensations. This model is continually updated and entrained by sensory prediction errors in wakefulness to ensure veridical perception, but not in dreaming.

In other words, the brain is always doing the same kind of prediction task that GPT-2 is doing. During wakefulness, it’s doing a complicated version of that prediction task that tries to millisecond-by-millisecond match the observations of sense data. During sleep, it’s just letting the prediction task run on its own, unchained to any external data source. Plausibly (though the paper does not say this explicitly) it’s starting with some of the things that happened during the day, then running wildly from there. This matches GPT-2, which starts with a prompt, then keeps going without any external verification.

This sort of explains the dream/GPT-2 similarity. But why would an unchained prediction task end up with dream logic? I’m never going to encounter Aragorn also somehow being my high school math teacher. This is a terrible thing to predict.

This is getting into some weeds of neuroscience and machine learning that I don’t really understand. But:

Hobson, Hong and Friston say that dreams are an attempt to refine model complexity separately from model accuracy. That is, a model is good insofar as it predicts true things (obviously) and is simple (this is just Occam’s Razor). All day long, your brain’s generative model is trying to predict true things, and in the process it snowballs in complexity; some studies suggest your synapses get 20% stronger over the course of the day, and this seems to have an effect on energy use as well – your brain runs literally hotter dealing with all the complicated calculations. At night, it switches to trying to make its model simpler, and this involves a lot of running the model without worrying about predictive accuracy. I don’t understand this argument at all. Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy: “the world is a uniform gray void” is very simple; its only flaw is not matching the data. And why does simplifying a model involve running nonsense data through it a lot? I’m not sure. But not understanding Karl Friston is a beloved neuroscientific tradition, and I am honored to be able to continue participating in it.

Some machine learning people I talked to took a slightly different approach to this, bringing up the wake-sleep algorithm and Boltzmann machines. These are neural net designs that naturally “dream” as part of their computations; ie in order to work, they need a step where they hallucinate some kind of random information, then forget that they did so. I don’t entirely understand these either, but they fit a pattern where there’s something psychiatrists have been puzzling about for centuries, people make up all sorts of theories involving childhood trauma and repressed sexuality, and then I mention it to a machine learning person and he says “Oh yeah, that’s [complicated-sounding math term], all our neural nets do that too.”

Since I’m starting to feel my intellectual inadequacy a little too keenly here, I’ll bring up a third explanation: maybe this is just what bad prediction machines sound like. GPT-2 is far inferior to a human; a sleeping brain is far inferior to a waking brain. Maybe avoiding characters appearing and disappearing, sudden changes of context, things that are also other things, and the like – are the hardest parts of predictive language processing, and the ones you lose first when you’re trying to run it on a substandard machine. Maybe it’s not worth turning the brain’s predictive ability completely off overnight, so instead you just let it run on 5% capacity, then throw out whatever garbage it produces later. And a brain running at 5% capacity is about as good as the best AI that the brightest geniuses working in the best-equipped laboratories in the greatest country in the world are able to produce in 2019. But:

We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text. We hope for future collaborations between computer scientists, linguists, and machine learning researchers.

A boring sentiment, except for the source: the AI wrote that when asked to describe itself. We live in interesting times.

Read the whole story
1231 days ago
Share this story

I am the reason for Hungarian notation in Android

1 Share

I am the reason for Hungarian notation in Android

You know, this:

private String mName;

That’s because of me.

There, I said it. It’s my fault.

This topic comes up on a regular basis and more recently, this reddit discussion reminded me that I had never really explained how this notation came about and also, how much misunderstood Hungarian Notation is. So I’d like to take this opportunity to clarify a few things, and I’ll do this in two parts:

1. How the m notation came about.
2. Why you probably don’t understand what Hungarian Notation is.

The m convention

I was one of the early engineers working on Android and within a few months of the project, I was tasked to come up with a coding convention style guide for Android API’s (ourselves, the Android team) and user code. We still had very few Java developers on the team at the time, so addressing this problem while the amount of Java code in Android was still reasonably small was a top priority.

When it comes to identifying fields, I come with a bit of a bias myself. At the time, I had written a decent amount of Java, Windows and C++ code and I had found that using a specific syntax for fields was very useful. Microsoft uses m_ to that effect, while it’s common to use a leading underscore (e.g. _name) in C++. Ever since I started writing Java code, I had always been bothered by the fact that Java did away with this convention.

But my task was to write a Java style guide and one of our goals with Android since day one was to create a development platform where Java developers would feel extremely comfortable.

So I put my bias aside. I took some time reviewing both Sun’s and Google’s own internal Java coding style guides and I came up with our own Android style guide, which was pretty much 99% what these two style guides proposed but with a few very tiny changes.

One of the difference I remember clearly was regarding braces. While both style guides require to use braces for everything, I introduced an exception when the continuing statement can fit on the same line. The idea behind this exception was to accomodate for the prevalent logging idiom in Android:

   if (Log.DEBUG) Log.d(tag, "Logging");

Without this exception, log statements would take up a lot of vertical screen space, something we all agreed was not desirable.

So that was the first version of our style guide, and that guide did not contain any requirement to prefix fields with anything.

I sent the guide to the team and, to my surprise, everybody pushed back very hard on it, precisely because it didn’t mandate any syntax for fields. Everybody felt that fields needed to be differentiated and they wouldn’t accept a style guide that didn’t have such a prescription.

So I went back to the drawing board and researched various ways to make this happen.

I considered _name and also m_name, as mentioned above, but rejected them because the underscore was too much of a detraction from Java’s standard. I came across a few other, more exotic notations (such as using the “iv” prefix, for “instance variable”) but ultimately, I rejected them all. No matter what I came across, the “m” prefix stuck with me as the most sensible and the least verbose.

So what was the obvious solution? Keep the “m”, remove the underscore and use camel case. And thereby, mName was born.

This proposal was accepted by the team and we then made this notation official.

You probably don’t understand Hungarian Notation

Whenever a discussion comes up about Hungarian Notation (HN), I notice that most people seem to think that whenever you add some metadata to an identifier, then it’s automatically HN. But this is ignoring the core concept behind HN and the very deliberate design that Simonyi put into it when he came up with this notation.

First of all, there are a lot of different kinds of metadata that you can add to identifier names, and they all belong to a different category. Here are the categories I have identified so far (there might be more):

  • Type information.
  • Scope information.
  • Semantic information.

Let’s review these in turn.

Type information

This is probably the most widespread use of identifier metadata: naming the identifier so that its type can be inferred. This is used everywhere in Win32/64 code, where you see names such as lpsz_name used to mean “Long Pointer to String with a Zero terminator”. While this notation seems to be extremely verbose and reader hostile, it actually quickly becomes second nature for Windows programmers to parse it, and the added information is actually very useful to debug the many obscure bugs that can happen in the bowels of the Windows system, mostly due to the very heavily dynamic nature of a lot of its API and the heavy reliance on C and C++.

Scope information

This is what’s used in Android: using the metadata to indicate what kind of variable you are dealing with: field, local or function parameter. It was quickly apparent to me that fields were really the most important aspect of a variable, so I decided that we wouldn’t need further conventions to discriminate local variables from function parameters. Again: note that this metadata has nothing to do with the type of the variable.

Semantic information

This is actually the least used metadata information and yet, arguably the most useful. This kind of discrimination can apply to variables of identical or similar types, or identical or similar scopes, but of different semantics.

This convention can be used when you need to differentiate variables of similar types but of different purposes. Most of the time, a sensible name will get you there, but sometimes, metadata is the only way out of this. For example, if you are designing a GUI that lets the user enter a name, you might have several variations of widgets called "name": an edit text called ("textName"), a text view ("tvName"), buttons to validate or cancel ("okName", "cancelName"), etc…

In such examples, it’s important to make it clear that these identifiers all relate to the same operation (editing a name) while differentiating their function (the metadata).

Hopefully, you should have a more nuanced view of Hungarian Notation now, and I strongly recommend to read Joel Spolsky’s “Making wrong code look wrong” article on this topic, which should help drive all these points home.

So, what do you really think about Hungarian Notation?

First of all, I think we need to stop using the umbrella name “Hungarian notation” because it’s too vague. When asked this question, I usually ask the person to clarify which of the three options listed above they are talking about (and most of the time, they’re not even sure and they need to think about it).

I’ll just use the term “identifier metadata” to describe the general idea behind adding information to a simple identifier name. And overall, I think this approach can have merits on a case per case basis. I don’t think it should be the default, but it’s definitely useful, especially in the GUI example that I described above. I find this kind of occurrence on a regular basis and not using identifier metadata for this kind of code leads to code that’s harder to read (both for the author and future readers) and to maintain.

I also don’t accept the argument that “Today, our IDE’s can distinguish all these identifiers with colors so we don’t need any of this any more”. This argument is flawed for two reasons:

  • Code is often read outside of IDE’s (starting, ironically, with the screen shot of the reddit discussion, whic has no highlighting). I read code in browsers, window terminals, diff tools, git tools, etc… Most of them don’t have the capacity for the kind of highlighting that would make parsing the code easier, so a light touch of identifier metadata can help a lot there.
  • IDE highlighting will still not help you make sense of ambiguous cases such as the GUI example described above. There are still cases where you, the developer, know more about your code than the IDE can ever know, and adding identifier metadata is then the only sensible choice you can make.

Don’t listen to people who tell you identifier metadata should never be used, or that it should always be used. This kind of naming is just a tool in your developer toolbox and common sense should make it relatively easy for you to determine when it’s the right time to add some metadata to your identifiers.

Finally, I often see very extreme reactions on this issue, and code conventions in general. Over the 30+ years that I’ve been writing code, I have noticed that it only takes a few days of working on code following a certain convention to completely accept it and stop noticing it altogether. There were times where I couldn’t tolerate code that is not indented with two spaces, then a few months later of working on a project with four space indentations, I felt the opposite. Same goes with naming conventions. You will get used to anything as long as the conventions are being followed consistently across the code base you are working on.

Read the whole story
1804 days ago
Share this story