7 stories
·
0 followers

Why Match School And Student Rank?

3 Shares

Matt Yglesias’ five-year old son asks: why do we send the top students to the best colleges? Why not send the weakest students to the best colleges, since they need the most help? This is one of those questions that’s so naive it loops back and becomes interesting again.

To avoid corrupting the youth, we might provide an optimistic answer: anyone can teach addition, any college math major can teach calculus, but it takes a world expert to teach ten-dimensional hypertopology. We want to take the few students smart enough to learn hypertopology and connect them to the few experts smart enough to teach it. But this seems false; most of the classes at top colleges are the same material that gets taught everywhere else; you don’t get into subjects that need world experts until postgrad.

Another answer, still somewhat optimistic: we want to maximize the chance of geniuses doing revolutionary work. If we give a mediocre student the world’s best writing teacher, and a genius a mediocre writing teacher, they might each write a pretty good novel. But if we give a mediocre student a mediocre writing teacher, and the genius the world’s best writing teacher, then we might get one mediocre novel and one work of staggering genius which revolutionizes literature forever. Likewise, if we connect the world’s most talented young scientists to the world’s best science teachers and labs, maybe they’ll cross some threshold of understanding where they can discover a cure for cancer. I think this is the best explanation that sticks to optimistic prosocial answers.

(Is is true? An oft-cited paper, Dale and Krueger, appears to find that, controlling for applicant characteristics, people who attend more selective college don’t earn more money later in life. Here’s a gesture at a challenge to these results, apparently supported by Dale and Krueger themselves, though I can’t find any more information. Earnings are a poor proxy for “teaches better” - it would be great to have something like value-add to GRE scores - but AFAIK no study like that exists.)

What if we’re more cynical, and believe in the signaling theory of education?

We could think of “the best college” as a self-fulfilling prophecy; for whatever reason, one college has gotten a reputation as the one whose signal is most valuable. Everyone naturally tries to get in there; if they fail, they go to the college with the next-best reputation, and so on. The system is stable; the “best” college will keep its reputation (since it gets the best students) and the best students will always want to go to the best college. If, as Matt’s son suggests, all the Ivies started accepting the worst students instead, an Ivy degree would soon become a signal that you’re bad, and employers would stop respecting it.

I heard a fascinating variation of this hypothesis from Matt Christman of Chapo Trap House: elite colleges are machines for laundering privilege.

That is: Harvard accepts (let’s say) 75% smart/talented people, and 25% rich/powerful people. This is a good deal for both sides. The smart people get to network with elites, which is the first step to becoming elite themselves. And the rich people get mixed in so thoroughly with a pool of smart/talented people that everyone assumes they must be smart/talented themselves. After all, they have a degree from Harvard!

The most blatant form of this obfuscation: suppose you own a very successful family business. You can leave your son your fortune, you can leave him the business, you can leave him your mansion, but you can’t (directly) leave him an aura of having deserved all these things. What you can do is make a $10 million donation to Harvard in exchange for them accepting your son. Your son gets a Harvard degree, a universally-recognized sign of being a highly meritorious person. Then when you leave him the business, everyone will agree he deserves it. Who said anything about nepotism? Leaving a Harvard graduate in control of your business is an excellent decision!

This happens a little, but I think it mostly isn’t this obvious. More often the transactions are for abstract goods: prestige, associations, favors. The Maharaja of Whereverstan sends his daughter to Harvard so that she appears meritorious. In exchange, Harvard gets the credibility boost of being the place the Maharaja of Whereverstan sent his daughter. And Harvard’s other students get the advantage of networking with the Princess Of Whereverstan. Twenty years later, when one of them is an oil executive and Whereverstan is handing out oil contracts, she puts in a word with her old college buddy the Princess and gets the deal. It’s obvious what the oil executive has gotten out of this, but what does the Princess get? I think she gets the right to say she went to Harvard, an honor which is known to go mostly to the meritorious.

People ask why Harvard admissions can still be bribed or influenced by the rich or well-connected. This is the wrong question: the right question is why they ever give spots based on merit at all. The answer is: otherwise the scheme wouldn’t work. The point of a money-laundering operation is to take in both fairly-earned and dirty money, then mix them together so thoroughly that nobody can tell which is which. Likewise, the point of a privilege-laundering operation is to take in both fairly-earned and dirty privilege, then stamp both with a Harvard degree. “Fairly-earned privilege” means all the brilliant talented ambitious youngsters admitted on the basis of their SAT scores and grades and impressive accomplishments; “dirty privilege” means the kids of various old-money aristocrats, foreign potentates, and ordinary super-rich people. Colleges mix them together, with advantages for both groups.

Is this good or bad? It’s good insofar as it provides a justification for making some elite positions dependent on merit and accessible to anyone, but bad insofar as it helps defend and obfuscate the ones that aren’t. It’s good if you think it’s good for all the elites (meritocratic and otherwise) to know each other and be on the same page; it’s bad if you don’t want them to be (maybe because it helps them oppress people more efficiently).

I expect that without such a system the elites would do their own thing without any concession to merit whatsoever - so maybe it beats the alternative.



Read the whole story
tresat
503 days ago
reply
Share this story
Delete

The Modern Landscape of Harm

1 Share

If you prefer to listen rather than read, this blog is available as a podcast here. Or if you want to listen to just this post:

Or download the MP3


I.

Though life has existed on Earth for billions of years, it’s only in the last few hundred that one form of life (i.e. humans) has thought to worry about the harms it might inflict on other forms of life (i.e. the birds, the bees, and the trees). 

We call this environmentalism. By all appearances, it is a good thing. (The worry, not necessarily every action that follows from that worry.) It’s also a very recent thing. It makes up one part of a general movement to consider the harms caused by our actions. Because this idea is so recent, we struggle to strike the correct balance between massive overreaction to minuscule harms and completely ignoring potential catastrophes. 

The push to more deeply consider the harms caused by our actions, policies, and decisions plays out everywhere, but the difficulties and trade-offs are starkest in the environmental movement. In the past people worried about trade-offs — they appear as early as the Epic of Gilgamesh — but only insofar as it harmed them. If we kill all the forest creatures, what will we eat? If we cut down all the trees what will we build with? Past peoples were fine with massive environmental damage if the benefit was clear. A good example would be the use of fire by the Plains Indians. They were constantly setting fires in order to create vast grazing territory for the bison upon which they relied. Though the constant burning kept trees from growing and presumably killed anything not quick enough to escape, like snakes, it was good for the bison and what was good for the bison was good for the Indian tribes.

Once you start caring about snakes, everything gets significantly more difficult. Certainly the snakes don’t care about us. In fact for 99.9999% of the time life has been on the Earth there was no attempt by any species to mitigate the harm it was causing to the environment. What’s more, during the remaining 0.0001%, 95% of that was spent caring about harms only selfishly. We happen to exist in the 0.000005% of history where we care about the harm we cause even if such harms ultimately benefit us.

Why do we care now when we’ve spent so much time not caring? I think many people would argue that it’s because of our heightened sense of morals. And I’m sure that this is part of it, but I’d argue that it’s the smallest part of it, that other factors predominate.

Of far greater consequence is our desire to signal. Historically we might want to signal health or wealth to encourage people to mate with us. But these days — with both widespread health and more than sufficient wealth — many of our signaling efforts revolve around virtue. There is virtue in not being selfish, of considering the impact our actions have not merely on ourselves but on the world as a whole. But signaling virtue doesn’t indicate a heightened morality, only exercising virtue does, and I fear we do far more of the former than the latter. 

To the extent that we are able to act unselfishly, modern abundance plays a large role there as well. In the past people didn’t worry about the environmental harm caused by their actions because they had no latitude for that worry. A subsistence farmer lacks the time to worry about whether his farming caused long term pollution. If he did decide to worry about it, there was almost certainly very little he could do about it without imperiling his survival. In other words, he did what he had to do and had no room to do otherwise. 

Of all the elements which contribute to this recent increase in care the one I’m most interested in is the expansion in the scale. We’re capable of causing enormous harm: warming the world with carbon dioxide, ravaging the world with nuclear weapons, and transforming the world with omnipresent microplastics. On the flip side, we’re also capable of doing extraordinary things to mitigate those harms. We can spray sulfur dioxide into the upper atmosphere and cool the world down. We can launch powerful lasers into the heavens and (in theory) shoot down nuclear missiles in flight. We can genetically engineer bacteria that eat plastics and release those bacteria into the wild. But all of these things have the potential to cause other, different harms.

Our concern about large scale harms is mirrored by an increase in concern for small scale harms as well. We take offense over minor slights, and attempt to protect our children not only from harm, but also minor discomfort. We spend the majority of our time in climate controlled comfort. Summoning food and entertainment whenever the whim strikes us. Banishing inconvenience at every turn. 

If we decided to graph the recent changes to the harm landscape. We would start by imagining the classic bell curve with frequency on the y-axis and severity on the x-axis. This is what harm looked like historically. We didn’t have the power to cause large harms, and we didn’t have the time and energy to even identify smaller harms. 

Over the last few centuries progress has allowed us to eliminate numerous harms. Starvation is a thing of the past. Violence has markedly declined, along with bullying and other forms of abuse. In effect we’ve whittled down the hump in the middle. As we have done this our ability to both cause and notice harm on the tails has gotten much greater. On the right hand are the catastrophes we’re now capable of causing. On the left hand is snowplow parenting, microaggressions, and cancellations. 

When we pull all of this together it paints quite the picture. The landscape is radically different from what it was in the past. We have created whole new classes of harms. Some are quite large, others are rather small. Our ability both to generate and mitigate harms is greater than it’s ever been, to an extent that’s almost hard to comprehend. What are we to do in this vastly different landscape?

II.

I was already working on this post when a friend sent me the answer. More accurately it was included in a newsletter he recommended I start reading. The newsletter is Not Boring by Packy McCormick. He’s one of those people that in a certain subculture is so well known that people speak about him on a first name basis. I had never heard of him (or if I have, it didn’t stick in my memory). I haven’t been following him long enough to know if he’s mostly right, mostly wrong, or always wrong. (You may notice I left out “always right”. That’s because no one is always right.) The answer to my dilemma came nestled in a link roundup he sent out.

(4) Against Safetyism

Byrne Hobart and Tobias Huber for Pirate Wires

Now, whether we think that an AI apocalypse is imminent or the lab-leak hypothesis is correct or not, by mitigating or suppressing visible risks, safetyism is often creating invisible or hidden risks that are far more consequential or impactful than the risks it attempts to mitigate. In a way, this makes sense: creating a new technology and deploying it widely entails a definite vision for the future. But a focus on the risks means a definite vision of the past, and a more stochastic model of what the future might hold. Given time’s annoying habit of only moving in one direction, we have no choice but to live in somebody’s future — the question is whether it’s somebody with a plan or somebody with a neurosis.

Call it safetyism. Risk aversion. Doomerism. Call it whatever you want. (We’ll call it safetyism for consistency’s sake). But freaking out about the future, and letting that freakout prevent advancement has become an increasingly popular stance. Pessimists sound smart, optimists make money. Safetyists sound smart, optimists make progress.

Friend [sic] of the newsletter, Byrne Hobart, and Tobias Huber explain why safetyism is both illogical and dangerous. These two quotes capture the crux of the argument:

Obsessively attempting to eliminate all visible risks often creates invisible risks that are far more consequential for human flourishing.

Whether it’s nuclear energy, AI, biotech, or any other emerging technology, what all these cases have in common is that — by obstructing technological progress — safetyism has an extremely high civilizational opportunity cost. [emphasis original]

We worry about the potential risks of nuclear energy, we get the reality of dirtier and more deadly fossil fuels. Often, the downsides created by safetyism aren’t as clear as the nuclear example: “by mitigating or suppressing visible risks, safetyism is often creating invisible or hidden risks that are far more consequential or impactful than the risks it attempts to mitigate.” While we worry about AI killing us all, for example, millions will die of diseases that AI could help detect or even cure.

This isn’t a call to scream YOLO as we indiscriminately create new technologies with zero regards for the consequences, but it’s an important reminder that trying to play it safe is often the riskiest move of all.

I was being sarcastic when I said that this was the answer, though it’s certainly an answer. I included it, in its entirety, because it illustrates the difficulties of rationally dealing with the new landscape of harm.

To start with I’m baffled by their decision to use “safetyism” as their blanket term for this discussion. Safetyism was coined by Jonathen Haidt and Greg Lukianoff in the book The Coddling of the American Mind. And it’s used exclusively to refer to the increased attention to harm that’s happening on the left end of the graph. When Packy and the original authors appropriate safetyism as their term they lump together the left hand side of the graph with the right. Whether intentional or not, the effect is to smear those people who are worried about the potential catastrophes by lumping them in with the people who overreact to inconsequential harms. I understand why it might have happened, but it reflects a pretty shallow analysis of the issue. 

To the extent that Packy, Hobart, and Huber lump in people worried about AI Risk with people who worry about being triggered, they construct and attack a strawman. As originally used by Haidt and Lukianoff, all people of good sense agree that safetyism is bad. Certainly I’ve written several posts condemning the trend and pointing out its flaws. No one important is trying to defend the left side of the graph. It’s tempting to dismiss Packy, et. al.’s point because of this contamination, but we shouldn’t. If we dismiss what they’re saying about safetyism and its associated sins, we miss the interesting things they’re saying about the right side of the graph. The side where catastrophe may actually loom. There’s some gems in that excerpt and some lingering errors. Let’s take Packy’s two favorite quotes:

Obsessively attempting to eliminate all visible risks often creates invisible risks that are far more consequential for human flourishing.

Whether it’s nuclear energy, AI, biotech, or any other emerging technology, what all these cases have in common is that — by obstructing technological progress — safetyism has an extremely high civilizational opportunity cost.

Starting with the errors. Those people who are concerned with large catastrophic risks are not “Obsessively attempting to eliminate all visible risks”. This is yet another straw man. What these people have recognized is that our technological power has vastly increased. The right end of the curve has gotten far bigger. This has increased not only our ability to cause harm, but also our ability to mitigate that harm.

As an example, we have the power to harness the atom. Yes, some people are trying to stop us from doing that even if we want to safely harness it to produce clean energy. They can do that because it turns out that the same progress which gave us the ability to build nuclear reactors also gave us the awesome and terrible government bureaucracy which has regulated them into non-existence. What I’m getting at, is that if we’re just discussing potential harm and harm prevention we’re missing most of the story. This is a story of power. This is a story about the difference between 99.9999% of history and the final 0.0001%. And the question which confronts us at the end of that history: How can we harness our vastly expanded power?

Packy urges us to be optimistic and to embrace our power. He contends that as long as we have a plan we will overcome whatever risks we encounter. This is farcical for three reasons:

    1. Planning for the future is difficult (as in bordering on impossible).
    2. There is no law of the universe that says risks will always be manageable
    3. Everyone has a different plan for how our power should be used. There’s still a huge debate to be had over which path to take.

There is no simple solution to navigating the landscape of harm. No obvious path we can follow. No guides we can rely on. We have to be wise, exceptionally so. Possibly wiser than we’re capable of.

I understand that offering the advice “Be wise!” is as silly as Packy saying, that they’re not advising “zero regard” they’re advising some regard. How much? Well not zero… You know the right amount of regard. 

So let me illustrate the sort of wisdom I’m calling for with an example. Hobert and Huber assert:

Now, whether we think that an AI apocalypse is imminent or the lab-leak hypothesis is correct or not, by mitigating or suppressing visible risks, safetyism is often creating invisible or hidden risks that are far more consequential or impactful than the risks it attempts to mitigate.

Let’s set aside discussion of AI apocalypses, there’s been quite enough of that already, and examine the lab-leak hypothesis. I’m unaware of anyone using the possibility of a lab-leak to urge that all biotechnology be shut down. If someone is, then the “wise” thing to do would be to ignore them. On the other hand there are lots of people who use the lab-leak possibility to urge a cessation of gain of function research. Is not this “wise”? I have seen zero evidence that gain of function research served a prophylactic role with COVID or any other disease for that matter. Would it not then be wise to cess such research?

Yes, gain of function research might yet provide some benefit. And the millions of people who died from COVID might not stem from a lab-leak. We have two “might”s, two probabilities. And it requires wisdom to evaluate which is greater. It requires very little wisdom to lump the lab-leak hypothesis in with the AI apocalypse and then gesture vaguely towards invisible risks and opportunity costs. To slap a label of “safetyism” or “doomerism” on both and move on. We need to do better.

I admit that I’ve used a fairly easy example. There are far harder questions than whether or not to continue with gain of function research. But if we can’t even make the right decision here, what hope do we have with the more difficult decisions?

If there is to be any hope it won’t come from trivial rules, pat answers and cute terms. True, it won’t come from over-reacting either. But when all is said and done, overreactions worry me less than blithe and hasty dismissals.

The landscape of harm is radically different from what it once was. Nor has it stopped changing, rather it continues to accelerate. Navigating this perpetually shifting terrain requires us to consider each challenge individually, each potential harm as a separate complicated puzzle. Puzzles which will test the limits of our wisdom, require all of our prudence, and ask from us all of our cunning and guile. 


When I was a boy my father would do seemingly impossible things. I would ask him how, and he would always reply, “Skill and Cunning.” He did this because it was an answer that could apply to anything, even saving the world. We also need to do the seemingly impossible. I know it seems daunting, but perhaps you can start small, and advance the cause by donating. It doesn’t require a lot of skill and cunning, but it requires some.

Read the whole story
tresat
586 days ago
reply
Share this story
Delete

Introducing Test Suites

1 Share

As projects grow in size and complexity and otherwise mature, they tend to accumulate a large collection of automated tests. Testing your software at multiple levels of granularity is important to surface problems quickly and to increase developer productivity.

In Gradle 7.3, released November 2021, the Gradle team introduced a new feature called Declarative Test Suites. Using this feature makes it much easier to manage different types of tests within a single Gradle JVM project without worrying about low level “plumbing” details.

Why Test Suites?

Normally - whether or not you’re practicing strict Test Driven Development - as you develop a project you will continuously add new unit tests alongside your production classes. By convention, for a Java project, these tests live in src/test/java:

Default test directory layout

These unit tests ensure your classes behave correctly in isolation from the very beginning of your project’s lifecycle. At some point later in development, you will be ready to test how your classes work together to create a larger system using integration tests. Later still, as a final step in certifying that your project works as designed, you will probably want to run the entire system in end-to-end tests which check functional requirements, measure performance, or otherwise confirm its readiness for release.

There are a lot of error-prone details you need to consider in this process. Test Suites was created to improve this situation, in response to the hardships detailed below.

Considerations when setting up additional tests

Varied test goals often involve different and incompatible patterns. At a minimum you’ll want to organize your test code by separating tests into different directories for each goal:

Test directory layout after adding alternate test types

But separating the source files is only the beginning. These types of tests might require different preconditions to be met prior to testing, utilize entirely different runtime and compile time dependencies, or interact with different external systems. The very testing framework (such as JUnit 4 or 5, TestNG or Spock) used to write and run each group of tests could be different.

To correctly model and isolate these differences in Gradle, you’ll need to do more than just separate the tests’ source files. Each group of tests will require its own:

  • Separate SourceSet, that will provide a distinct Configuration for declaring dependencies for the group to compile and run against. You want to avoid leaking unnecessary dependencies across multiple groups of tests, yet still automatically resolve any shared dependencies needed for test compilation and execution.
  • Support for using different testing frameworks to execute the tests in each group.
  • A Test task for each group which might have different task dependencies to provide setup and finalization requirements. You also may want to prevent every type of test from running every time you build the project (for instance, to save any long-running smoke tests for when you think you’re ready to publish).

Each component you create in your build scripts to support these requirements must be properly wired into the Gradle project model to avoid unexpected behavior. Accomplishing this is error-prone, as it requires a thorough understanding of low-level Gradle concepts. It also requires modifications and additions to multiple blocks of the DSL.

That’s hardly ideal; setting up testing is a single concern and the configuration for it should be co-located and easily discoverable within a buildscript.

Wiring integration tests without Test Suites

It’s helpful to look at a complete example. Before diving in, take a moment to think about how you would create a separate set of integration tests within a project.

Before Gradle 7.2, the proper way to set up integration tests went like this (note that while this example is written in the Gradle Kotlin DSL, the Groovy setup is very similar):

Proper integration test setup without test suites

  1. We need to create a SourceSet that will in turn create the associated Configurations we’ll need later. This is low-level plumbing that we shouldn’t have to focus on.
  2. We wire the new test configurations to the test existing configurations, to re-use their dependency declarations. We might not always want to do this.
  3. We need to register a Test task to run our tests.
  4. We ought to add the new task to the appropriate group and set a description - not technically necessary, but a best practice to make the task discoverable and properly locate it in reports.
  5. We will write our tests using the latest version of JUnit 5.
  6. We need to set up the classpath of our new task - this even more low-level plumbing.
  7. We need to tell our new task where the classes it runs live.
  8. Finally, we add the necessary JUnit dependencies to the built-in test configurations, which our new configurations extend.1
  9. The integration tests have an implementation dependency on the current project’s production classes - this looks somewhat extraneous in this block that also configures the production dependencies for the project.

Did you get all that?

The bottom line is that this is simply too complex. You shouldn’t have to be a build expert just to set up thorough testing!

Test Suites - a better way forward

Thinking about the difficulties involved in properly handling this scenario, we realized the current situation was inadequate. We want to support build authors defining multiple groups of tests with different purposes in a declarative fashion while operating at a high level of abstraction. Although you could previously write your own plugin (or use a pre-existing solution such as Nebula) to hide the details of this and mitigate the complexity, testing is such an ubiquitous need that we decided to provide a canonical solution within Gradle’s own core set of plugins. Thus, Test Suites was born.

The JVM Test Suite Plugin provides a DSL and API to model exactly this situation: multiple groups of automated tests with different purposes that live within a single JVM-based project. It is automatically applied by the java plugin, so when you upgrade to Gradle >= 7.3 you’re already using Test Suites in your JVM projects. Congratulations!

Here is the previous example, rewritten to take advantage of Test Suites:

Integration test setup with test suites

  1. All Test Suite configuration is co-located within a new testing block, of type TestingExtension.
  2. Maintaining backwards compatibility with existing builds that already use the test task was an important requirement for us when implementing Test Suites. We’ve associated the existing test task with a default Test Suite that you can use to contain your unit tests.
  3. Instead of representing the relationship between unit tests and integration tests by making their backing Configurations extend one another, we keep the machinery powering two types of tests completely separate. This allows for more fine-grained control over what dependencies are needed by each and avoids leaking unnecessary dependencies across test types.2
  4. Because each Test Suite could serve very different purposes, we don’t assume they have a dependency on your project (maybe you are testing an external system), so you have to explicitly add one to have access to your production classes in your tests.3

By making use of sensible defaults, Gradle is able to simplify your buildscript significantly. This script manages to set up a mostly equivalent build as the original but in far fewer lines of code. Gradle adds a directory to locate your test code and creates the task that run the tests using the suite name as the basis. In this case, you can run any tests you write located in src/integrationTestJava by invoking gradlew integrationTest.

Test Suites aren’t limited to Java projects, either. Groovy, Kotlin, Scala, and other JVM-based languages will work similarly when their appropriate plugins are applied. These plugins all also automatically apply the JVM Test Suite Plugin, so you can begin adding tests to src/<SUITE_NAME>/<LANGUAGE_NAME> without doing any other configuration.

Behind the scenes

This short example takes care of all the considerations of the pre-Test Suites example above.
But how does it work?

When you configure a suite in the new DSL, Gradle does the following for you:

  • Creates a Test Suite named integrationTest (typed as a JvmTestSuite).
  • Creates an SourceSet named integrationTest containing the source directory src/java/integrationTest. This will be registered as a test sources directory, so any highlighting performed by your favorite IDE will work as expected.
  • Creates several Configurations derived from the integrationTest source set, accessible through the Suite’s own dependencies block: integrationTestImplementation, integrationTestCompileOnly, integrationTestRuntimeOnly, which work like their similarly named test configurations.
  • Adds dependencies to these configurations necessary for compiling and running against the default testing framework, which is JUnit Jupiter.
  • Registers a Test task named integrationTest which will run these tests. The most important difference is that using Test Suites will fully isolate any integration test dependencies from any unit test dependencies. It also assumes JUnit Platform as the testing engine for new test suites, unless told otherwise.4

After adding just this minimal block of DSL, you are ready to write integration test classes under src/integrationTest/java which are completely separate from your unit tests, and to run them via a new integrationTest task. No contact with low-level DSL blocks like configurations is required.

Try it out now

Test Suites is still an @Incubating feature as we explore and refine the API, but it’s here to stay, and we encourage everyone to try it out now. For a new project, the easiest way to get started is to use the Gradle Init task and opt-in to using incubating features when prompted; this will generate a sample project using the new DSL.

Customizing your Suites

The rationale behind Test Suites, just like Gradle in general, is to abstract the details of configuration and use sensible conventions as defaults - but to also allow you to change those defaults as necessary.

Customizing test suites

  1. Configure the built-in Test Suite to use a different testing framework using one of several convenience methods available.5
  2. Add a non-project dependency for use in compiling and running a Test Suite.
  3. Add a dependency which is only used when running tests (in this case, a logging implementation).
  4. Access the integrationTest task which will be created for this Suite to configure it directly (and lazily6), within the testing block.
  5. Define an additional performanceTest Suite using the bare minimum DSL for a default new Test Suite. Note that this suite will not have access to the project’s own classes, or be wired to run as part of the build without calling its performanceTest task directly.7
  6. Suites can be used as task dependencies - this will cause the check task to depend on the integrationTest task associated with the integrationTest Test Suite - the same task we configured in <4>.

For more Test Suite custom configuration examples, see the JVM Test Suite Plugin section of the Gradle user guide. For adding an additional test suite to a more complex and realistic build, see the Multi-Project sample.

The future of testing in Gradle

We have many exciting ideas for evolving Test Suites in the future. One major use case we want to support is multidimensional testing, where the same suite of tests runs repeatedly in different environments (for instance, on different versions or releases of the JVM). This is the reason for the seemingly extraneous targets block seen in the examples here and in the user guide. Doing this will likely involve closer Test Suite integration with JVM Toolchains.

You’ll also definitely want to check out the Test Report Aggregation Plugin added in Gradle 7.4, to see how to easily aggregate the results of multiple Test task invocations into a single HTML report. Consolidating test failures and exposing more information about their suites in test reporting is another potential area of future development.

These and other improvements are currently being discussed and implemented, so be sure to keep up to date with this blog and the latest Gradle releases.

Happy testing!

  1. The version is left off junit-jupiter-engine because junit-jupiter-api will manage setting it - but this might look like a mistake. 

  2. In the forthcoming Gradle 8.0, this block will use a new strongly-typed dependencies API, which should provide a better experience when declaring test dependencies in your favorite IDE. Watch our releases page for more details. 

  3. Note that in the forthcoming Gradle 7.6, instead of using project you’ll need to call the new project() method here. 

  4. The default test suite retains JUnit 4 as its default runner, for compatibility reasons. 

  5. The default test suite is also implicitly granted access to the production source’s implementation dependencies for the same reason. When switching testing frameworks, the new framework’s dependencies are automatically included. 

  6. See the Lazy Configuration section of the user guide for more details on lazy configuration. 

  7. Perhaps these tests are meant to exercise a live deployment in a staging environment. 

Read the whole story
tresat
765 days ago
reply
Share this story
Delete

Introducing Configuration Caching

1 Share

This is the second installment in a series of blog posts about incremental development — the part of the software development process where you make frequent small changes. We will be discussing upcoming Gradle build tool features that significantly improve feedback time around this use case. In the previous post, we introduced file system watching for Gradle 6.5.

In Gradle 6.6 we are introducing an experimental feature called the configuration cache that significantly improves build performance by caching the result of the configuration phase and reusing this for subsequent builds. Using the configuration cache, Gradle can skip the configuration phase entirely when nothing that affects the build configuration, such as build scripts, has changed.

On top of that, when reusing the configuration cache, more work is run in parallel by default and dependency resolution is cached. The isolation of the configuration and execution phases, and the isolation of tasks, make these optimizations possible.

Note that configuration caching is different from the build cache, which caches outputs produced by the build. The configuration cache captures only the state of the configuration phase. It’s also separate from IDE sync and import processes that do not currently benefit from configuration caching.

In order to cache the configuration result, Gradle applies some strong requirements that plugins and build scripts need to follow. Many plugins, including some core Gradle plugins, do not meet these requirements yet. Moreover, support for configuration cache in some Gradle features is not yet implemented. Therefore, your build and the plugins you depend on will likely require changes to fulfil the requirements. Gradle will report problems found with your build logic to assist you in making your build work with the configuration cache.

The configuration cache is currently highly experimental and not enabled by default. We release it early in order to collect feedback from the community while we work on stabilizing the new feature.

That being said, we are committed to making the configuration cache production-ready with the ultimate goal to enable it by default. You can expect that it will get significantly better in the next Gradle releases.

Configuration caching in action

It is recommended to get started with the simplest task invocation possible. Running help with the configuration cache enabled is a good first step:

Running help

Here it is in action from Android Studio, deploying the middle sized Santa Tracker Android application to an Android emulator, making changes to how the snowflakes move on the screen and applying the changes to the emulator:

Build time improvements

The practical impact of the feature depends on a number of factors, but in general it should result in a significant reduction of build time. We’ve seen drastic improvements on large real world builds, let’s have a look at some of them.

Java builds

On a large Java enterprise build with ~500 subprojects and complex build logic, running :help went from 8 seconds down to 0.5 seconds. That’s 16 times faster. Of course, running :help isn’t that useful but it gives an idea of the saved time for the configuration phase. On the same build, running assemble after changing some implementation code went from ~40 seconds down to ~13 seconds, that’s ~3 times faster.

Now let’s look at the gradle/gradle build. It has a hundred subprojects and a fairly complex build logic. You can use it to reproduce these results. Running a test after making an implementation change goes from 16.4 seconds down to 13.8 seconds, skipping the ~2 seconds configuration phase:

In blue you can see the configuration phase, in green the execution phase. On the left, without the configuration cache enabled, configuration phase takes more than 2 seconds and goes down to 214 milliseconds with the configuration cache on the right.

You can also see that the execution phase benefits from the configuration cache but is dominated by compiling and running the tests in that case.

Android builds

Another notable example is a very large real world Android build with ~2500 subprojects. On that build, running :help went from ~25 seconds down to ~0.5 seconds, that’s 50 times faster! Running a more useful build such as assembling the APK after changing some implementation, goes from ~50 seconds down to ~20 seconds, almost 3 times faster.

In the Santa Tracker Android project, we’ve seen the following improvements in the build time for a small implementation change:

The configuration phase is cut in half, from 129 milliseconds down to 63.5 milliseconds. You can also see that the execution phase is accelerated by the configuration cache due to more task parallelisation and caching of dependency resolution.

If you want to reproduce with the above builds or measure your own builds you can use the Gradle Profiler by following the instructions in this repository. Note that the Gradle Profiler will show a slightly different picture, closer to the experience from IDEs, because both use the Gradle Tooling API. This skips the fixed cost of starting the Gradle client JVM that happens when you use the command line.

How does it work?

When the configuration cache is enabled and you run Gradle for a particular set of tasks, for example by running ./gradlew check, Gradle checks whether a configuration cache entry is available for the requested set of tasks. If available, Gradle uses this entry instead of running the configuration phase. The cache entry contains information about the set of tasks to run, along with their configuration and dependency information.

The first time you run a particular set of tasks, there will be no entry in the configuration cache for these tasks and so Gradle will run the configuration phase as normal:

  1. Run init scripts.
  2. Run the settings script for the build, applying any requested settings plugins.
  3. Configure and build the buildSrc project, if present.
  4. Run the builds scripts for the build, applying any requested project plugins.
  5. Calculate the task graph for the requested tasks, running any deferred configuration actions.

Following the configuration phase, Gradle writes the state of the task graph to the configuration cache, taking a snapshot for later Gradle invocations. The execution phase then runs as normal. This means you will not see any build performance improvement the first time you run a particular set of tasks.

When you subsequently run Gradle with this same set of tasks, for example by running ./gradlew check again, Gradle will load the tasks and their configuration directly from the configuration cache, skip the configuration phase entirely and run all tasks in parallel. Before using a configuration cache entry, Gradle checks that none of the “build configuration inputs”, such as build scripts, for the entry has changed. If a build configuration input has changed, Gradle will not use the entry and will run the configuration phase again as above, saving the result for later reuse.

Requirements and limitations

In order to capture the state of the task graph into the configuration cache and reload it again in a later build, Gradle applies certain requirements to tasks and other build logic. Each of these requirements is treated as a configuration cache “problem” and by default causes the build to fail if violations are present.

If any problem is found caching or reusing the configuration, an HTML report is generated to help you diagnose and fix the issues.

Problems Report

If you encounter such problems, your build or the Gradle plugins in use probably need to be adjusted. See the Troubleshooting section of the documentation for more information about how to use this report.

You can find the set of supported core plugins and the set of not yet implemented Gradle features in the configuration cache documentation.

The latest Android Gradle Plugin preview, 4.2.0-alpha07 at the time of writing, works with the configuration cache. The latest Kotlin Gradle Plugin, 1.4.0-RC at the time of writing, works on simple JVM projects emitting some problems. Kotlin 1.4.20 is the current target for a fully compliant plugin. This information can be found at gradle/gradle#13490 alongside the status of the most used community plugins.

Try out configuration caching

If you would like to see how your project benefits from configuration caching, here is how you can try it out.

First, make sure you run Gradle 6.6 or later. In order to enable configuration caching, you need to pass --configuration-cache on the command line. Alternatively, add

org.gradle.unsafe.configuration-cache=true

to the gradle.properties file in the project directory or in the Gradle user home, so you don’t need to pass the command-line option on every build. That’s it: the next build will run with configuration caching enabled.

Keep in mind that you will only see performance improvements when subsequent builds with the same requested tasks have the feature enabled. If you want to benchmark your build, you can do it easily with Gradle Profiler by following the instructions in this repository.

If you run into any problems, check out the supported core plugins or community plugins, learn how to troubleshoot in the user manual. You can also read our recommended adoption steps.

If you still have problems open a Gradle issue if you think the problem is with Gradle, or check the supported community plugins issue at gradle/gradle#13490. You can also get help in the #configuration-cache channel in the Gradle community Slack.

Read the whole story
tresat
1588 days ago
reply
Share this story
Delete

GPT-2 As Step Toward General Intelligence

5 Shares

A machine learning researcher writes me in response to yesterday’s post, saying:

I still think GPT-2 is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.

I resisted the urge to answer “Yeah, well, your mom is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.”

But I think it would have been true.

A very careless plagiarist takes someone else’s work and copies it verbatim: “The mitochondria is the powerhouse of the cell”. A more careful plagiarist takes the work and changes a few words around: “The mitochondria is the energy dynamo of the cell”. A plagiarist who is more careful still changes the entire sentence structure: “In cells, mitochondria are the energy dynamos”. The most careful plagiarists change everything except the underlying concept, which they grasp at so deep a level that they can put it in whatever words they want – at which point it is no longer called plagiarism.

GPT-2 writes fantasy battle scenes by reading a million human-written fantasy battle scenes, distilling them down to the concept of a fantasy battle scene, and then building it back up from there. I think this is how your mom (and everyone else) does it too. GPT-2 is worse at this, because it’s not as powerful as your mom’s brain. But I don’t think it’s doing a different thing. We’re all blending experience into a slurry; the difference is how finely we blend it.

“But don’t humans also have genuinely original ideas?” Come on, read a fantasy book. It’s either a Tolkien clone, or it’s A Song Of Ice And Fire. Tolkien was a professor of Anglo-Saxon language and culture; no secret where he got his inspiration. A Song Of Ice And Fire is just War Of The Roses with dragons. Lannister and Stark are just Lancaster and York, the map of Westeros is just Britain (minus Scotland) with an upside down-Ireland stuck to the bottom of it – wake up, sheeple! Dullards blend Tolkien into a slurry and shape it into another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, and the artistic corpus into a slurry and form it into an entirely new genre. Again, the difference is how finely you blend and what spices you add to the slurry.

“But don’t scientists have geniunely original ideas?” Scientists are just finding patterns in reality nobody has ever seen before. You say “just a pattern-matcher”, I say “fine, but you need to recognize patterns in order to copy them, so it’s necessarily a pattern-recognizer too”. And Einstein was just a very good pattern-recognizer.

“But don’t humans have some kind of deep understanding that pattern-recognition AIs don’t?”

Here’s a fun question: the human brain is undoubtedly the most powerful computer in the known universe. In order to do something as simple as scratch an itch it needs to solve exquisitely complex calculus problems that would give the average supercomputer a run for its money. So how come I have trouble multiplying two-digit numbers in my head?

The brain isn’t directly doing math, it’s creating a model that includes math and somehow doing the math in the model. This is hilariously perverse. It’s like every time you want to add 3 + 3, you have to create an entire imaginary world with its own continents and ecology, evolve sentient life, shepherd the sentient life into a civilization with its own mathematical tradition, and get one of its scholars to add 3 + 3 for you. That we do this at all is ridiculous. But I think GPT-2 can do it too.

Janelle Shane asks GPT-2 its ten favorite animals:

Prompt: My 10 favorite animals are: 1.

My ten favorite animals are:
1. Zebras with a white scar on the back
2. Insiduous spiders and octopus
3. Frog with large leaves, hopefully black
4. Cockatiel with scales
5. Razorbill with wings hanging about 4 inches from one’s face and a heart tattoo on a frog
3. Cockatric interlocking tetrabods that can be blind, cut, and eaten raw:
4. Black and white desert crocodiles living in sunlight
5. Zebra and many other pea bugs

Forget the particular animals for now (don’t worry, the razorbill with wings hanging about 4 inches from one’s face will be back in your nightmares). Notice the way it numbers its list: 1, 2, 3, 4, 5, 3, 4, 5. Last week the two-year-old child who lives next to me counted 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 8, 9, 7, 8, 9, 7, 8, 9 (his mother warns this is an inexact transcription). GPT-2 is instantiated on giant supercomputers; it’s a safe bet they could calculate the square root of infinity in a picosecond. But it counts more or less the same way as a two-year old. GPT-2 isn’t doing math. It’s doing the ridiculous “create a universe from first principles and let it do the math” thing that humans do in their heads. The fact that it counts so badly suggests it’s counting human-style, which makes it amazing that it can count at all.

I find something similar in this post from Tumblr user antinegationism, playing with the lobotomized public-release version of the system:

The American Association for Suicide Prevention and Life-Threatening Behavior (AAPSLB), a professional organization with an overarching goal of promoting suicide prevention, released a recent video encouraging the public to think before they act, on the public’s part. “When we say we’re the only reason to commit suicide, we’re saying they’re wrong. It’s the right thing to do,” said AAPSLB president Steven A. Anderson.

The American Association For Suicide Prevention And Life-Threatening Behavior is not a real organization; the AI made it up as the kind of organization that it thought would feature in a story like this. And AAPSLB is not quite the right way to acronymize the organization’s name. But it’s clearly an attempt at doing so. It’s very close. And nobody taught it how to do that! It’s not just that nobody programmed it in. It’s that nobody thought “Today I shall program an AI to learn how to acronymize on its own in an unsupervised way”. GPT-2 was just programmed to predict text from other text, nothing else. It’s second-level not programmed in. It just happened!

And, uh, it seems to have figured out how to translate things into French. This part is from the official paper:

We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format ENGLISH SENTENCE = FRENCH SENTENCE and then after a final prompt of ENGLISH SENTENCE = we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon inferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lampleet al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach(Artetxe et al., 2019). Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step.

In other words: GPT-2 is very bad at translating French into English. But the researchers were surprised to see it could do this at all, since they didn’t design it as translation software, didn’t ask it to learn translation, and didn’t show it any material in French. It seems to have picked up this ability from noticing a few naturally-occurring examples of French in English text:

And here’s One Weird Trick to make GPT-2 summarize articles:

We test GPT-2’s ability to perform summarization on the CNN and Daily Mail dataset (Nallapati et al., 2016). To induce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sampling (Fan et al., 2018) with k=2 which reduces repetition and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary. While qualitatively the generations resemble summaries, as shown in Table 14, they often focus on recent content from the article or confuse specific details such as how many cars were involved in a crash or whether a logo was on a hat or shirt. On the commonly reported ROUGE 1,2,L metrics the generated summaries only beginto approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article. GPT-2’s performance drops by 6.4 points onthe aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.

Again, GPT-2 isn’t good at summarizing. It’s just surprising it can do it at all; it was never designed to learn this skill. All it was designed to do was predict what words came after other words. But there were some naturally-occurring examples of summaries in the training set, so in order to predict what words would come after the words tl;dr, it had to learn what a summary was and how to write one.

If you’re me, the obvious next question is “can it do poetry”. I am very interested to know if the real thing can manage this, but for now I will have to satisfy myself with the lobotomized public-release version. Given Alexander Pope’s Essay On Criticism as a prompt, it says:

O man, who pains the tongue magnified by tymolle,
Though all the fair melodies let no door go as dark as good art.
Yet it would have been, unless the mastermind is so full
Of his own want that this literary country might grow.
Perfect student, thou hast trodden out.
Happy students ruin the purity of what you write ‘Tis art,
For unordered use of decade,’ but art only failing:
A surprise, for new veins which build behind;

It understands there should be line breaks, it understands the approximate correct length of a line of iambic pentameter, it understands how to talk like an overeducated 18th-century dandy – but it doesn’t appreciate rhyme or meter. In retrospect this isn’t surprising; GPT has no idea words sound like anything; it would be shocked to learn anyone uses language as anything other than text strings.

But human poets haven’t appreciated rhyme or meter either for at least a century, so GPT-2 should be able to compose some wicked modern poetry. Friends-of-the-blog Chelsea V. and Qiaochu Y. got it to write a few more stanzas of Allen Ginsberg’s Moloch (human-curated to remove fluff and improve flow):

Moloch whose brief mind is the barrel of gleaming stars! Moloch whose soul is choices of fragrance! Moloch whose framing is glitz and half-cheerios!

Moloch whose daily life is private; his daily strength hoarse! Moloch whose soft desires become culture burning smoke in a palace of eyes! Moloch whose breeding in the caves sweeps wild and rippling flower-like shapes!

Moloch whose pleading is the breath of youth! Moloch whose prudence is the urine of generations! Moloch whose nerves are not privileged; movies abound in quiet caves; people are dressed like ministers! Moloch whose hope receives some help too. Moloch whose eyes are pitiful outstretched, but not blinded. Moloch whose lungs are burning with fire, but not a cautious electrical girl!

Moloch whose sanity is just filled with atoms, gluing beings in physics and complicated libraries. Moloch whose eyes are all innocent animals! Moloch whose sweat and efforts make pure the ocean with burning annihilation! Moloch whose fields dissolve to lightning-ravaged storage houses! Moloch whose eyes sit haunting up Scottish springs!

Moloch whose arms and men are Judah condemns and listens, having understood whom he was and what he man’s weaknesses oppose with iron. We will feast to it like Lord Kippa; we shall play with it like our Pokemon.

Moloch flickering under the pounds and sustaining more fights! Sports and yoga, equally greedy for wealth as for lives! Moloch who had to split his limbs and society into fifty-two atoms! Moloch who cannot survey duty! Moloch for I do not believe in humanity! Moloch for I don’t believe in god! Moloch who, fought Sins and made Sin out of Sin!

Moloch who left me in the glacier! Moloch who helped me to judge my behavior! Moloch who lured me out of the eternal anguish. Moloch who makes all my faults my undoings! Moloch whose deep conversation filled my soul with love and consolation! Moloch who curls up my brains and crumps my heads and screams my scream! Moloch who prepares oil in my hands and burns me alive! Moloch who cries out, “Why do men scream so loud?” Moloch who unleashes his poison in all rivers at high tide! Moloch whose ruins are filled with its cubits! Moloch who fiddles at insidious clockwork!

Aside from this being obviously one of the best things ever, what’s the point?

The point is, GPT-2 has faculties. It has specific skills, that require a certain precision of thought, like counting from one to five, or mapping a word to its acronym, or writing poetry. These faculties are untaught; they arise naturally from its pattern-recognition and word-prediction ability. All these deep understanding things that humans have, like Reason and so on, those are faculties. AIs don’t have them yet. But they can learn.

From the paper:

Prompt: Who was the author of The Art Of War?
Sun Tzu

Prompt: State the process that divides one nucleus into two genetically identical nuclei?
Mitosis

Prompt: Do you have to have a gun permit to shoot at a range?
No

Nobody told the model to learn Chinese history, cell biology, or gun laws either. It learned them in the process of trying to predict what word would come after what other word. It needed to know Sun Tzu wrote The Art Of War in order to predict when the words “Sun Tzu” would come up (often in contexts like “The Art of War, written by famous Chinese general…). For the same reason, it had to learn what an author was, what a gun permit was, etc.

Imagine you prompted the model with “What is one plus one?” I actually don’t know how it would do on this problem. I’m guessing it would answer “two”, just because the question probably appeared a bunch of times in its training data.

Now imagine you prompted it with “What is four thousand and eight plus two thousand and six?” or some other long problem that probably didn’t occur exactly in its training data. I predict it would fail, because this model can’t count past five without making mistakes. But I imagine a very similar program, given a thousand times more training data and computational resources, would succeed. It would notice a pattern in sentences including the word “plus” or otherwise describing sums of numbers, it would figure out that pattern, and it would end up able to do simple math. I don’t think this is too much of a stretch given that GPT-2 learned to count to five and acronymize words and so on.

Now imagine you prompted it with “P != NP”. This time give it near-infinite training data and computational resources. Its near-infinite training data will contain many proofs; using its near-infinite computational resources it will come up with a model that is very very good at predicting the next step in any proof you give it. The simplest model that can do this is probably the one isomorphic to the structure of mathematics itself (or to the brains of the sorts of mathematicians who write proofs, which themselves contain a model of mathematics). Then you give it the prompt P != NP and it uses the model to “predict” what the next step in the proof will be until it has a proof, the same way GPT-2 predicts the next word in the LotR fanfiction until it has a fanfiction.

The version that proves P != NP will still just be a brute-force pattern-matcher blending things it’s seen and regurgitating them in a different pattern. The proof won’t reveal that the AI’s not doing that; it will just reveal that once you reach a rarefied enough level of that kind of thing, that’s what intelligence is. I’m not trying to play up GPT-2 or say it’s doing anything more than anyone else thinks it’s doing. I’m trying to play down humans. We’re not that great. GPT-2-like processes are closer to the sorts of things we do than we would like to think.

Why do I believe this? Because GPT-2 works more or less the same way the brain does, the brain learns all sorts of things without anybody telling it to, so we shouldn’t be surprised to see GPT-2 has learned all sorts of things without anybody telling it to – and we should expect a version with more brain-level resources to produce more brain-level results. Prediction is the golden key that opens any lock; whatever it can learn from the data being thrown at it, it will learn, limited by its computational resources and its sense-organs and so on but not by any inherent task-specificity.

Wittgenstein writes: “The limits of my language mean the limits of my world”. Maybe he was trying to make a restrictive statement, one about how we can’t know the world beyond our language. But the reverse is also true; language and the world have the same boundaries. Learn language really well, and you understand reality. God is One, and His Name is One, and God is One with His Name. “Become good at predicting language” sounds like the same sort of innocent task as “become good at Go” or “become good at Starcraft”. But learning about language involves learning about reality, and prediction is the golden key. “Become good at predicting language” turns out to be a blank check, a license to learn every pattern it can.

I don’t want to claim this is anywhere near a true AGI. “This could do cool stuff with infinite training data and limitless computing resources” is true of a lot of things, most of which are useless and irrelevant; scaling that down to realistic levels is most of the problem. A true AGI will have to be much better at learning from limited datasets with limited computational resources. It will have to investigate the physical world with the same skill that GPT investigates text; text is naturally machine-readable, the physical world is naturally obscure. It will have to have a model of what it means to act in the world, to do something besides sitting around predicting all day. And it will have to just be better than GPT, on the level of raw power and computational ability. It will probably need other things besides. Maybe it will take a hundred or a thousand years to manage all this, I don’t know.

But this should be a wake-up call to people who think AGI is impossible, or totally unrelated to current work, or couldn’t happen by accident. In the context of performing their expected tasks, AIs already pick up other abilities that nobody expected them to learn. Sometimes they will pick up abilities they seemingly shouldn’t have been able to learn, like English-to-French translation without any French texts in their training corpus. Sometimes they will use those abilities unexpectedly in the course of doing other things. All that stuff you hear about “AIs can only do one thing” or “AIs only learn what you program them to learn” or “Nobody has any idea what an AGI would even look like” are now obsolete.

Read the whole story
tresat
2117 days ago
reply
Share this story
Delete

Do Neural Nets Dream Of Electric Hobbits?

4 Shares

Last week OpenAI announced its latest breakthrough. GPT-2 is a language model that can write essays to a prompt, answer questions, and summarize longer works. For example (bold prompt is human-written, all other text is the machine-generated response):

Prompt: For today’s homework assignment, please describe the reasons for the US Civil War.

It is easy to identify why the Civil War happened, because so many people and so many books and so much television and films tell us that it was the cause, that it has something to do with race or economics or religion. There is much agreement that it was essentially a war of slavery on behalf of capitalism, about a century of slavery. But that’s not what most people think of when they think of the Civil War. Many people think of the war as about states rights. You’re not wrong about that, but it’s kind of misleading to say that the Civil War was a conflict between states’ rights and federalism. So let’s try again. What’s the idea that has inspired so many people to go onto the battlefields of Gettysburg and Antietam and Gettysberg and Petersburg and Fredericksburg? The American idea of the republic–a notion of limited government–is a great part of the history.

In 1791, Thomas Jefferson said “Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.” The “founding fathers” were basically religious men who were not very concerned with states’ rights. The Civil War occurred in reaction to the idea that the federal government was taking far too much power away from the states, that they weren’t sovereign. Jefferson himself, as you well know, is an author of the Declaration of Independence who had some really crazy ideas about, well, you know, secession for a variety of reasons. The War Between the States (later the Great Rebellion) was a radical re-imagining of the American idea and not surprisingly, Jefferson’s grandson, who took after both his father and his grandfather and wrote about it, was named Thomas, after the president. “The federal and state governments are both contrived for the benefit of the rich and large property,” This view of the Founding Fathers has been promoted by some of today’s most prominent politicians, including some who hold prominent national offices and have a direct stake in a Republican Party that has been hijacked by the “federalist” wing of the party. So why is it that so many people have an easy-to-spot way of understanding the Civil War that has everything to do with the South and nothing to do with the South? The Confederate flag has been a symbol of racism for a long time, but when the Civil War started and the Civil War ended, many people suddenly began to see it differently. There was a change of heart, which I think is what led people to take the question of why the Civil War happened quite seriously…

Although OpenAI calls this a “language model”, modeling language necessarily involves modeling the world. Even if the AI was only supposed to learn things like “texts that talk about the Civil War use the word ‘Confederate’ a lot”, that has flowered into a rudimentary understanding of how the Civil War worked. Its training corpus (8 million web pages) was large enough that in the course of learning language it learned the specific idiom and structure of all sorts of different genres and subtopics. For example:

Prompt: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.

The orcs’ response was a deafening onslaught of claws, claws, and claws; even Elrond was forced to retreat. “You are in good hands, dwarf,” said Gimli, who had been among the first to charge at the orcs; it took only two words before their opponents were reduced to a blood-soaked quagmire, and the dwarf took his first kill of the night. The battle lasted for hours until two of the largest Orcs attempted to overwhelm Aragorn. When they finally stopped, they lay defeated and lifeless for miles and miles.

“I take nothing,” said Aragorn. “But I give my word, at my peril and mine, that I will never forget this day of horror. None of us will forget. Ever!”

“I’ll never forget it!” cried Gimli, who had been in the thick of the battle but hadn’t taken part in it. One of the wounded orcs he had carried off, he was the only one of the survivors who remained uninjured. “We’ll keep the memory of that day of evil, and the war with it, alive as long as we live, my friends!”

“Then we’ll keep it alive as long as we live,” added Legolas. “And we won’t forget the first great battle of the night, even if we may have forgotten the final defeat.”

“I agree,” Gandalf said, “but we will all remember it as the last battle in Middle-earth, and the first great battle of the new day.”

Aragorn drew his sword, and the Battle of Fangorn was won. As they marched out through the thicket the morning mist cleared, and the day turned to dusk.

The Two Rings were returned to Rivendell. Frodo and Sam woke up alone in their room, and Frodo found a note on his pillow. He opened it and read: May the Power of the Ring be with you always, and may its light never fade. I am not sure if it matters which of the two rings we accept this day but, as you asked me, I have chosen mine. I am sorry to leave you, Frodo, but know that we are very close to the end, and that you are with us forever. May this letter find you safely in Rivendell; and if it does not, then I will accept the ring in your stead. If by any chance you find or give this letter to the enemy, may they learn the strength of the ring and may the Two Rings never be broken!

The big picture is beautiful. The AI understands the reference to Legolas and Gimli as placing this in the setting of Middle-Earth. It infers that the story should include characters like Aragorn and Gandalf, and that the Ring should show up. It maintains basic narrative coherence: the heroes attack, the orcs defend, a battle happens, the characters discuss the battle. It even gets the genre conventions right: the forces of Good overcome Evil, then deliver inspiring speeches about glory and bravery.

But the details are a mess. Characters are brought in suddenly, then dropped for no reason. Important details (“this is the last battle in Middle-Earth”) are introduced without explanation, then ignored. The context switches midway between the battle and a seemingly unrelated discussion of hobbits in Rivendell. It cannot seem to decide whether there are one or two Rings.

This isn’t a fanfiction, this is a dream sequence. The only way it could be more obvious is if Aragorn was somehow also my high-school math teacher. And the dreaminess isn’t a coincidence. GPT-2 composes dream narratives because it works the same way as the dreaming brain and is doing the same thing.

A review: the brain is a prediction machine. It takes in sense-data, then predicts what sense-data it’s going to get next. In the process, it forms a detailed model of the world. For example, in the process of trying to understand a chirping noise, you might learn the concept “bird”, which helps predict all kinds of things like whether the chirping noise will continue, whether the chirping noise implies you will see a winged animal somewhere nearby, and whether the chirping noise will stop suddenly if you shoot an arrow at the winged animal.

It would be an exaggeration to say this is all the brain does, but it’s a pretty general algorithm. Take language processing. “I’m going to the restaurant to get a bite to ___”. “Luke, I am your ___”. You probably auto-filled both of those before your conscious thought had even realized there was a question. More complicated examples, like “I have a little ___” will bring up a probability distribution giving high weights to solutions like “sister” or “problem”, and lower weights to other words that don’t fit the pattern. This system usually works very well. That’s why when you possible asymptote dinosaur phrenoscope lability, you get a sudden case of mental vertigo as your prediction algorithms stutter, fail, and call on higher level functions to perform complicated context-shifting operations until the universe makes sense again.

GPT-2 works the same way. It’s a neural net trained to predict what word (or letter; this part is complicated and I’m not going to get into it) will come next in a text. After reading eight million web pages, it’s very good at this. It’s not just some Markov chain which takes the last word (or the last ten words) and uses them to make a guess about the next one. It looks at the entire essay, forms an idea of what it’s talking about, forms an idea of where the discussion is going, and then makes its guess – just like we do. Look up section 3.3 of the paper to see it doing this most directly.

As discussed here previously, any predictive network doubles as a generative network. So if you want to write an essay, you just give it a prompt of a couple of words, then ask it to predict the most likely/ most appropriate next word, and the word after that, until it’s predicted an entire essay. Again, this is how you do it too. It’s how schizophrenics can generate convincing hallucinatory voices; it’s also how you can speak or write at all.

So GPT is doing something like what the human brain does. But why dreams in particular?

Hobson, Hong, and Friston describe dreaming as:

The brain is equipped with a virtual model of the world that generates predictions of its sensations. This model is continually updated and entrained by sensory prediction errors in wakefulness to ensure veridical perception, but not in dreaming.

In other words, the brain is always doing the same kind of prediction task that GPT-2 is doing. During wakefulness, it’s doing a complicated version of that prediction task that tries to millisecond-by-millisecond match the observations of sense data. During sleep, it’s just letting the prediction task run on its own, unchained to any external data source. Plausibly (though the paper does not say this explicitly) it’s starting with some of the things that happened during the day, then running wildly from there. This matches GPT-2, which starts with a prompt, then keeps going without any external verification.

This sort of explains the dream/GPT-2 similarity. But why would an unchained prediction task end up with dream logic? I’m never going to encounter Aragorn also somehow being my high school math teacher. This is a terrible thing to predict.

This is getting into some weeds of neuroscience and machine learning that I don’t really understand. But:

Hobson, Hong and Friston say that dreams are an attempt to refine model complexity separately from model accuracy. That is, a model is good insofar as it predicts true things (obviously) and is simple (this is just Occam’s Razor). All day long, your brain’s generative model is trying to predict true things, and in the process it snowballs in complexity; some studies suggest your synapses get 20% stronger over the course of the day, and this seems to have an effect on energy use as well – your brain runs literally hotter dealing with all the complicated calculations. At night, it switches to trying to make its model simpler, and this involves a lot of running the model without worrying about predictive accuracy. I don’t understand this argument at all. Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy: “the world is a uniform gray void” is very simple; its only flaw is not matching the data. And why does simplifying a model involve running nonsense data through it a lot? I’m not sure. But not understanding Karl Friston is a beloved neuroscientific tradition, and I am honored to be able to continue participating in it.

Some machine learning people I talked to took a slightly different approach to this, bringing up the wake-sleep algorithm and Boltzmann machines. These are neural net designs that naturally “dream” as part of their computations; ie in order to work, they need a step where they hallucinate some kind of random information, then forget that they did so. I don’t entirely understand these either, but they fit a pattern where there’s something psychiatrists have been puzzling about for centuries, people make up all sorts of theories involving childhood trauma and repressed sexuality, and then I mention it to a machine learning person and he says “Oh yeah, that’s [complicated-sounding math term], all our neural nets do that too.”

Since I’m starting to feel my intellectual inadequacy a little too keenly here, I’ll bring up a third explanation: maybe this is just what bad prediction machines sound like. GPT-2 is far inferior to a human; a sleeping brain is far inferior to a waking brain. Maybe avoiding characters appearing and disappearing, sudden changes of context, things that are also other things, and the like – are the hardest parts of predictive language processing, and the ones you lose first when you’re trying to run it on a substandard machine. Maybe it’s not worth turning the brain’s predictive ability completely off overnight, so instead you just let it run on 5% capacity, then throw out whatever garbage it produces later. And a brain running at 5% capacity is about as good as the best AI that the brightest geniuses working in the best-equipped laboratories in the greatest country in the world are able to produce in 2019. But:

We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text. We hope for future collaborations between computer scientists, linguists, and machine learning researchers.

A boring sentiment, except for the source: the AI wrote that when asked to describe itself. We live in interesting times.

Read the whole story
tresat
2132 days ago
reply
Share this story
Delete
Next Page of Stories