tresat's blurblog

Where the Magic Doesn’t Happen by Andy Crouch
Thursday January 2^nd, 2025 at 9:17 AM

After Babel

Introduction from Jon Haidt:

In my first book, The Happiness Hypothesis, I found that ancient wisdom about human relationships, consciousness, and flourishing holds up well in light of modern psychology. But as the digital age drowns us in exponentially increasing rates of new content—most of which is trivial and ephemeral—it is becoming clear that almost everything more than a few years old gets buried by incoming content. This is a serious problem for the continuity of any civilization if most writing and ideas propagate laterally (from peer-to-peer) and very little propagates longitudinally, from generation-to-generation. Our godlike technology may be cutting us off from the accumulated and hard-won wisdom of humanity.

But there are still communities that maintain ties to ancient wisdom, communities in which adults share the work of morally forming the next generation, not just their own children. The clearest examples are religious communities in which home, school, and house of worship are the three main institutions that, when well coordinated, will root children in moral traditions and protect them from anomie. We have already covered the ways that Jewish and Christian communities are doing this, beginning by making their schools, churches and synagogues phone-free zones.

Today we bring you one of the deepest thinkers about society, technology, and religion, Andy Crouch. Andy has been an editor at Christianity Today, and he is currently a partner at Praxis, an organization that fosters mission-driven Christian entrepreneurship. He is also the author of several books including The Tech-Wise Family and The Life We’re Looking For: Reclaiming Relationship in a Technological World. I first met Andy in 2022 when we were brought together via the Trinity Forum. We discussed the biblical story of the Tower of Babel, and the role of technology in bringing us into a second post-Babel era. We spoke together again in 2024 at a Veritas Forum event at NYU, titled Hey Siri, How Do I Find Myself? A Conversation on Spirituality & Technology.

In these talks and in his books, Andy has developed a way of speaking about technology as “magic” that exposes in an instant the damage done to the moral development (or “formation”) of children when homes, schools, and churches become saturated by smartphones and other technologies. I think his argument is so profound, and so urgent for everyone involved in raising or educating children, that I invited Andy to write up a version of it just for readers of After Babel.

– Jon

Where the Magic Doesn’t Happen

Kids using phones during Church. Zach created using ChatGPT.

It’s amazing how often modern people talk about magic.

To be modern, almost by definition, is to live without putting much stock in a supernatural “beyond” to the world. And yet, nearly every time a new technology is introduced, its promoters reach back to the ancient idea of magic to capture its significance.

It was true of Steve Jobs at the dawn of the smartphone—Apple’s advertising for many years was saturated with promises of a “magical” experience. And now it is true of the next wave of tech. Sam Altmann of OpenAI opens a recent essay: “In the next couple of decades, we will be able to do things that would have seemed like magic to our grandparents.” Once you start looking for the technology-magic connection, you’ll find it nearly everywhere—especially in the promotion phase, before the technology actually arrives.

Even more surprising is how often we still talk about a specific magical tradition: the practice of alchemy. For centuries, alchemists sought to transmute all metals into gold, to escape the conditions of mortality, and perhaps even to create new forms of life that would answer to our command—all summed up in the quest for the substance known as “the Philosopher’s Stone.”

Now, if to be modern is to largely disbelieve in magic, surely to be modern is to know that the alchemists’ quest failed. If we think of alchemy at all, we think of it in contrast with a proper science like chemistry. The alchemists were wrong about the natural world—the chemists, after much trial and error, were right.1

So why is alchemy—a dead-end approach to the natural world—such a recurring theme in our own time? It is there, after all, in the best-selling novels in history, which had the Philosopher’s Stone right in the title of the first book—except in the US, where the publisher apparently judged that readers would miss the reference and changed the title to Harry Potter and the Sorcerer’s Stone. It’s there in Paolo Coelho’s mega-selling novel The Alchemist (it seems unlikely that a novel called The Chemist would have had the same success). It’s there, more subtly, in the title of Israeli historian Yuval Harari’s “brief history of tomorrow,” Homo Deus, which captures the deepest quest of the alchemists: to finally rise above our creaturely status and command the world like gods.

And then you read venture capitalist Marc Andreesen’s “Techno-Optimist Manifesto” from 2023 and find alchemy there not as metaphor, but as a straightforward statement of belief: “We believe Artificial Intelligence is our alchemy, our Philosopher’s Stone – we are literally making sand think.”

Here is what I think is going on. Alchemy failed as science, but it succeeded as a dream. Magic doesn’t “work,” in the sense that science works, but it does work as a dream. And technology is, after all, applied science. Applied to what? To a dream that was there long before science, the dream of magic.

Share

Think of magic, for the moment, as the quest for instant, effortless power—the ability to get things done without taking time and without requiring labor or toil. In the absence of magic (or technology), getting anything done requires some amount of time, sometimes a great deal of time. But what if you could get results without waiting?

Likewise, in the absence of magic (or technology), getting anything done requires effort, sometimes substantial effort. But what if you could get results with just the wave of a wand, or swipe of a finger, essentially effort-free? (Actually, waving a wand might still be too much effort. Tesla’s release notes for their updated “summon” technology promise, “It’s like magic, but with more tech and less wand-waving.”)

Until the dawn of the technological age, when we invented reliable machines and the engines to power them, work was invariably constrained by both time and effort. All work, you might say, proceeded at the speed of digestion. Human beings and domesticated animals cooperated to turn organic processes into energy that powered nearly all our work in the world (though for a few purposes we learned to draw on wind or water power).

About a century ago, however, we turned a corner. For millennia humanity had pursued what the Greeks called techne—the art, skill, and craft of traditional tools. Once machines arrived, and once cybernetics and computation made them reliable, portable, and powerful in more and more domains, we crossed into a new story—the story of devices. It is no coincidence that at this point we adopted a new word for human ingenuity—no longer tools and the art of toolmaking, but technology, the art of device-making. Devices made possible for ordinary people what magicians had been seeking all along: to get results almost instantly, almost effortlessly, with almost no skill required from the user.2 In the technological era we no longer primarily acquire tools—we buy devices, and then upgrade them, in the quest to shrink that “almost” one step further.

The dream of magic, which eluded generations of alchemists to their great perplexity and frustration, has come true for us.

This sounds like it should be good news—and if your grandparents would be amazed, as Altman implies, to see what we will soon be able to do (the best magic, by the way, is always ahead in the future), there’s no doubt the alchemists would be thrilled. There is also no doubt that technology has relieved us of all kinds of tedium, which is why it seems so futile to imagine rolling back technology in most respects.

But there are some places, and some times, where we should absolutely resist the magic of technology, and where, if it has managed to insinuate itself, we should move as quickly as possible to limit it. Because there is one thing magic is absolutely terrible for, and that is the formation of healthy, thriving human beings—or to use a better and deeper word, the formation of persons at their best.

The reason magic is so bad at, and so bad for, the formation of persons is simple. Persons are not formed instantly, and persons are not formed effortlessly. And persons do have to be formed. We arrive in the world less prepared for survival than any other creature, far more in need of shaping and development. All the more so if we go beyond the mere requirements of survival and aim for something like flourishing. Flourishing, for human beings, absolutely requires formation.

What does it take to become a reasonably mature, reasonably wise, reasonably loving person? Inescapably, a great deal of time. Not just the years of cognitive and social development from infancy through adolescence into early adulthood—roughly 25 years from birth to the maturation of the prefrontal cortex. But also years of friendship, long hours of conversation, even the pause between hearing and speaking that marks the truly personal moment of really listening. In her 2011 book Alone Together, Sherry Turkle writes of the seven-minute mark at which conversations take a turn—the point when the usual opening gambits, pleasantries about weather or sports, have run out, there is a palpable pause, and someone has to take a risk. It takes seven minutes for a conversation, a real one, to even begin.

It is at the seven-minute pause, Turkle observed in her lab, that many people take out their phones, implicitly signaling to each other that the conversation need not go any further or deeper, an exit ramp before the unpredictable and vulnerable words beyond the silence. That, of course, was more than a decade ago. What are the chances that conversations last even that long these days? When you have magic in your pocket, why wait to see what will happen with a person?

Why Match School And Student Rank? by Scott Alexander
Sunday August 6^th, 2023 at 6:16 PM

Astral Codex Ten

Matt Yglesias’ five-year old son asks: why do we send the top students to the best colleges? Why not send the weakest students to the best colleges, since they need the most help? This is one of those questions that’s so naive it loops back and becomes interesting again.

To avoid corrupting the youth, we might provide an optimistic answer: anyone can teach addition, any college math major can teach calculus, but it takes a world expert to teach ten-dimensional hypertopology. We want to take the few students smart enough to learn hypertopology and connect them to the few experts smart enough to teach it. But this seems false; most of the classes at top colleges are the same material that gets taught everywhere else; you don’t get into subjects that need world experts until postgrad.

Another answer, still somewhat optimistic: we want to maximize the chance of geniuses doing revolutionary work. If we give a mediocre student the world’s best writing teacher, and a genius a mediocre writing teacher, they might each write a pretty good novel. But if we give a mediocre student a mediocre writing teacher, and the genius the world’s best writing teacher, then we might get one mediocre novel and one work of staggering genius which revolutionizes literature forever. Likewise, if we connect the world’s most talented young scientists to the world’s best science teachers and labs, maybe they’ll cross some threshold of understanding where they can discover a cure for cancer. I think this is the best explanation that sticks to optimistic prosocial answers.

(Is is true? An oft-cited paper, Dale and Krueger, appears to find that, controlling for applicant characteristics, people who attend more selective college don’t earn more money later in life. Here’s a gesture at a challenge to these results, apparently supported by Dale and Krueger themselves, though I can’t find any more information. Earnings are a poor proxy for “teaches better” - it would be great to have something like value-add to GRE scores - but AFAIK no study like that exists.)

What if we’re more cynical, and believe in the signaling theory of education?

We could think of “the best college” as a self-fulfilling prophecy; for whatever reason, one college has gotten a reputation as the one whose signal is most valuable. Everyone naturally tries to get in there; if they fail, they go to the college with the next-best reputation, and so on. The system is stable; the “best” college will keep its reputation (since it gets the best students) and the best students will always want to go to the best college. If, as Matt’s son suggests, all the Ivies started accepting the worst students instead, an Ivy degree would soon become a signal that you’re bad, and employers would stop respecting it.

I heard a fascinating variation of this hypothesis from Matt Christman of Chapo Trap House: elite colleges are machines for laundering privilege.

That is: Harvard accepts (let’s say) 75% smart/talented people, and 25% rich/powerful people. This is a good deal for both sides. The smart people get to network with elites, which is the first step to becoming elite themselves. And the rich people get mixed in so thoroughly with a pool of smart/talented people that everyone assumes they must be smart/talented themselves. After all, they have a degree from Harvard!

The most blatant form of this obfuscation: suppose you own a very successful family business. You can leave your son your fortune, you can leave him the business, you can leave him your mansion, but you can’t (directly) leave him an aura of having deserved all these things. What you can do is make a $10 million donation to Harvard in exchange for them accepting your son. Your son gets a Harvard degree, a universally-recognized sign of being a highly meritorious person. Then when you leave him the business, everyone will agree he deserves it. Who said anything about nepotism? Leaving a Harvard graduate in control of your business is an excellent decision!

This happens a little, but I think it mostly isn’t this obvious. More often the transactions are for abstract goods: prestige, associations, favors. The Maharaja of Whereverstan sends his daughter to Harvard so that she appears meritorious. In exchange, Harvard gets the credibility boost of being the place the Maharaja of Whereverstan sent his daughter. And Harvard’s other students get the advantage of networking with the Princess Of Whereverstan. Twenty years later, when one of them is an oil executive and Whereverstan is handing out oil contracts, she puts in a word with her old college buddy the Princess and gets the deal. It’s obvious what the oil executive has gotten out of this, but what does the Princess get? I think she gets the right to say she went to Harvard, an honor which is known to go mostly to the meritorious.

People ask why Harvard admissions can still be bribed or influenced by the rich or well-connected. This is the wrong question: the right question is why they ever give spots based on merit at all. The answer is: otherwise the scheme wouldn’t work. The point of a money-laundering operation is to take in both fairly-earned and dirty money, then mix them together so thoroughly that nobody can tell which is which. Likewise, the point of a privilege-laundering operation is to take in both fairly-earned and dirty privilege, then stamp both with a Harvard degree. “Fairly-earned privilege” means all the brilliant talented ambitious youngsters admitted on the basis of their SAT scores and grades and impressive accomplishments; “dirty privilege” means the kids of various old-money aristocrats, foreign potentates, and ordinary super-rich people. Colleges mix them together, with advantages for both groups.

Is this good or bad? It’s good insofar as it provides a justification for making some elite positions dependent on merit and accessible to anyone, but bad insofar as it helps defend and obfuscate the ones that aren’t. It’s good if you think it’s good for all the elites (meritocratic and otherwise) to know each other and be on the same page; it’s bad if you don’t want them to be (maybe because it helps them oppress people more efficiently).

I expect that without such a system the elites would do their own thing without any concession to merit whatsoever - so maybe it beats the alternative.

Read the whole story

tresat

699 days ago

reply

The Modern Landscape of Harm by Jeremiah
Monday May 15^th, 2023 at 7:04 AM

We Are Not Saved

If you prefer to listen rather than read, this blog is available as a podcast here. Or if you want to listen to just this post:

Or download the MP3

I.

Though life has existed on Earth for billions of years, it’s only in the last few hundred that one form of life (i.e. humans) has thought to worry about the harms it might inflict on other forms of life (i.e. the birds, the bees, and the trees).

We call this environmentalism. By all appearances, it is a good thing. (The worry, not necessarily every action that follows from that worry.) It’s also a very recent thing. It makes up one part of a general movement to consider the harms caused by our actions. Because this idea is so recent, we struggle to strike the correct balance between massive overreaction to minuscule harms and completely ignoring potential catastrophes.

The push to more deeply consider the harms caused by our actions, policies, and decisions plays out everywhere, but the difficulties and trade-offs are starkest in the environmental movement. In the past people worried about trade-offs — they appear as early as the Epic of Gilgamesh — but only insofar as it harmed them. If we kill all the forest creatures, what will we eat? If we cut down all the trees what will we build with? Past peoples were fine with massive environmental damage if the benefit was clear. A good example would be the use of fire by the Plains Indians. They were constantly setting fires in order to create vast grazing territory for the bison upon which they relied. Though the constant burning kept trees from growing and presumably killed anything not quick enough to escape, like snakes, it was good for the bison and what was good for the bison was good for the Indian tribes.

Once you start caring about snakes, everything gets significantly more difficult. Certainly the snakes don’t care about us. In fact for 99.9999% of the time life has been on the Earth there was no attempt by any species to mitigate the harm it was causing to the environment. What’s more, during the remaining 0.0001%, 95% of that was spent caring about harms only selfishly. We happen to exist in the 0.000005% of history where we care about the harm we cause even if such harms ultimately benefit us.

Why do we care now when we’ve spent so much time not caring? I think many people would argue that it’s because of our heightened sense of morals. And I’m sure that this is part of it, but I’d argue that it’s the smallest part of it, that other factors predominate.

Of far greater consequence is our desire to signal. Historically we might want to signal health or wealth to encourage people to mate with us. But these days — with both widespread health and more than sufficient wealth — many of our signaling efforts revolve around virtue. There is virtue in not being selfish, of considering the impact our actions have not merely on ourselves but on the world as a whole. But signaling virtue doesn’t indicate a heightened morality, only exercising virtue does, and I fear we do far more of the former than the latter.

To the extent that we are able to act unselfishly, modern abundance plays a large role there as well. In the past people didn’t worry about the environmental harm caused by their actions because they had no latitude for that worry. A subsistence farmer lacks the time to worry about whether his farming caused long term pollution. If he did decide to worry about it, there was almost certainly very little he could do about it without imperiling his survival. In other words, he did what he had to do and had no room to do otherwise.

Of all the elements which contribute to this recent increase in care the one I’m most interested in is the expansion in the scale. We’re capable of causing enormous harm: warming the world with carbon dioxide, ravaging the world with nuclear weapons, and transforming the world with omnipresent microplastics. On the flip side, we’re also capable of doing extraordinary things to mitigate those harms. We can spray sulfur dioxide into the upper atmosphere and cool the world down. We can launch powerful lasers into the heavens and (in theory) shoot down nuclear missiles in flight. We can genetically engineer bacteria that eat plastics and release those bacteria into the wild. But all of these things have the potential to cause other, different harms.

Our concern about large scale harms is mirrored by an increase in concern for small scale harms as well. We take offense over minor slights, and attempt to protect our children not only from harm, but also minor discomfort. We spend the majority of our time in climate controlled comfort. Summoning food and entertainment whenever the whim strikes us. Banishing inconvenience at every turn.

If we decided to graph the recent changes to the harm landscape. We would start by imagining the classic bell curve with frequency on the y-axis and severity on the x-axis. This is what harm looked like historically. We didn’t have the power to cause large harms, and we didn’t have the time and energy to even identify smaller harms.

Over the last few centuries progress has allowed us to eliminate numerous harms. Starvation is a thing of the past. Violence has markedly declined, along with bullying and other forms of abuse. In effect we’ve whittled down the hump in the middle. As we have done this our ability to both cause and notice harm on the tails has gotten much greater. On the right hand are the catastrophes we’re now capable of causing. On the left hand is snowplow parenting, microaggressions, and cancellations.

When we pull all of this together it paints quite the picture. The landscape is radically different from what it was in the past. We have created whole new classes of harms. Some are quite large, others are rather small. Our ability both to generate and mitigate harms is greater than it’s ever been, to an extent that’s almost hard to comprehend. What are we to do in this vastly different landscape?

II.

I was already working on this post when a friend sent me the answer. More accurately it was included in a newsletter he recommended I start reading. The newsletter is Not Boring by Packy McCormick. He’s one of those people that in a certain subculture is so well known that people speak about him on a first name basis. I had never heard of him (or if I have, it didn’t stick in my memory). I haven’t been following him long enough to know if he’s mostly right, mostly wrong, or always wrong. (You may notice I left out “always right”. That’s because no one is always right.) The answer to my dilemma came nestled in a link roundup he sent out.

(4) Against Safetyism

Byrne Hobart and Tobias Huber for Pirate Wires

Now, whether we think that an AI apocalypse is imminent or the lab-leak hypothesis is correct or not, by mitigating or suppressing visible risks, safetyism is often creating invisible or hidden risks that are far more consequential or impactful than the risks it attempts to mitigate. In a way, this makes sense: creating a new technology and deploying it widely entails a definite vision for the future. But a focus on the risks means a definite vision of the past, and a more stochastic model of what the future might hold. Given time’s annoying habit of only moving in one direction, we have no choice but to live in somebody’s future — the question is whether it’s somebody with a plan or somebody with a neurosis.

Call it safetyism. Risk aversion. Doomerism. Call it whatever you want. (We’ll call it safetyism for consistency’s sake). But freaking out about the future, and letting that freakout prevent advancement has become an increasingly popular stance. Pessimists sound smart, optimists make money. Safetyists sound smart, optimists make progress.

Friend [sic] of the newsletter, Byrne Hobart, and Tobias Huber explain why safetyism is both illogical and dangerous. These two quotes capture the crux of the argument:

Obsessively attempting to eliminate all visible risks often creates invisible risks that are far more consequential for human flourishing.

Whether it’s nuclear energy, AI, biotech, or any other emerging technology, what all these cases have in common is that — by obstructing technological progress — safetyism has an extremely high civilizational opportunity cost. [emphasis original]

We worry about the potential risks of nuclear energy, we get the reality of dirtier and more deadly fossil fuels. Often, the downsides created by safetyism aren’t as clear as the nuclear example: “by mitigating or suppressing visible risks, safetyism is often creating invisible or hidden risks that are far more consequential or impactful than the risks it attempts to mitigate.” While we worry about AI killing us all, for example, millions will die of diseases that AI could help detect or even cure.

This isn’t a call to scream YOLO as we indiscriminately create new technologies with zero regards for the consequences, but it’s an important reminder that trying to play it safe is often the riskiest move of all.

I was being sarcastic when I said that this was the answer, though it’s certainly an answer. I included it, in its entirety, because it illustrates the difficulties of rationally dealing with the new landscape of harm.

To start with I’m baffled by their decision to use “safetyism” as their blanket term for this discussion. Safetyism was coined by Jonathen Haidt and Greg Lukianoff in the book The Coddling of the American Mind. And it’s used exclusively to refer to the increased attention to harm that’s happening on the left end of the graph. When Packy and the original authors appropriate safetyism as their term they lump together the left hand side of the graph with the right. Whether intentional or not, the effect is to smear those people who are worried about the potential catastrophes by lumping them in with the people who overreact to inconsequential harms. I understand why it might have happened, but it reflects a pretty shallow analysis of the issue.

To the extent that Packy, Hobart, and Huber lump in people worried about AI Risk with people who worry about being triggered, they construct and attack a strawman. As originally used by Haidt and Lukianoff, all people of good sense agree that safetyism is bad. Certainly I’ve written several posts condemning the trend and pointing out its flaws. No one important is trying to defend the left side of the graph. It’s tempting to dismiss Packy, et. al.’s point because of this contamination, but we shouldn’t. If we dismiss what they’re saying about safetyism and its associated sins, we miss the interesting things they’re saying about the right side of the graph. The side where catastrophe may actually loom. There’s some gems in that excerpt and some lingering errors. Let’s take Packy’s two favorite quotes:

Obsessively attempting to eliminate all visible risks often creates invisible risks that are far more consequential for human flourishing.

Whether it’s nuclear energy, AI, biotech, or any other emerging technology, what all these cases have in common is that — by obstructing technological progress — safetyism has an extremely high civilizational opportunity cost.

Starting with the errors. Those people who are concerned with large catastrophic risks are not “Obsessively attempting to eliminate all visible risks”. This is yet another straw man. What these people have recognized is that our technological power has vastly increased. The right end of the curve has gotten far bigger. This has increased not only our ability to cause harm, but also our ability to mitigate that harm.

As an example, we have the power to harness the atom. Yes, some people are trying to stop us from doing that even if we want to safely harness it to produce clean energy. They can do that because it turns out that the same progress which gave us the ability to build nuclear reactors also gave us the awesome and terrible government bureaucracy which has regulated them into non-existence. What I’m getting at, is that if we’re just discussing potential harm and harm prevention we’re missing most of the story. This is a story of power. This is a story about the difference between 99.9999% of history and the final 0.0001%. And the question which confronts us at the end of that history: How can we harness our vastly expanded power?

Packy urges us to be optimistic and to embrace our power. He contends that as long as we have a plan we will overcome whatever risks we encounter. This is farcical for three reasons:

1. Planning for the future is difficult (as in bordering on impossible).
2. There is no law of the universe that says risks will always be manageable
3. Everyone has a different plan for how our power should be used. There’s still a huge debate to be had over which path to take.

There is no simple solution to navigating the landscape of harm. No obvious path we can follow. No guides we can rely on. We have to be wise, exceptionally so. Possibly wiser than we’re capable of.

I understand that offering the advice “Be wise!” is as silly as Packy saying, that they’re not advising “zero regard” they’re advising some regard. How much? Well not zero… You know the right amount of regard.

So let me illustrate the sort of wisdom I’m calling for with an example. Hobert and Huber assert:

Now, whether we think that an AI apocalypse is imminent or the lab-leak hypothesis is correct or not, by mitigating or suppressing visible risks, safetyism is often creating invisible or hidden risks that are far more consequential or impactful than the risks it attempts to mitigate.

Let’s set aside discussion of AI apocalypses, there’s been quite enough of that already, and examine the lab-leak hypothesis. I’m unaware of anyone using the possibility of a lab-leak to urge that all biotechnology be shut down. If someone is, then the “wise” thing to do would be to ignore them. On the other hand there are lots of people who use the lab-leak possibility to urge a cessation of gain of function research. Is not this “wise”? I have seen zero evidence that gain of function research served a prophylactic role with COVID or any other disease for that matter. Would it not then be wise to cess such research?

Yes, gain of function research might yet provide some benefit. And the millions of people who died from COVID might not stem from a lab-leak. We have two “might”s, two probabilities. And it requires wisdom to evaluate which is greater. It requires very little wisdom to lump the lab-leak hypothesis in with the AI apocalypse and then gesture vaguely towards invisible risks and opportunity costs. To slap a label of “safetyism” or “doomerism” on both and move on. We need to do better.

I admit that I’ve used a fairly easy example. There are far harder questions than whether or not to continue with gain of function research. But if we can’t even make the right decision here, what hope do we have with the more difficult decisions?

If there is to be any hope it won’t come from trivial rules, pat answers and cute terms. True, it won’t come from over-reacting either. But when all is said and done, overreactions worry me less than blithe and hasty dismissals.

The landscape of harm is radically different from what it once was. Nor has it stopped changing, rather it continues to accelerate. Navigating this perpetually shifting terrain requires us to consider each challenge individually, each potential harm as a separate complicated puzzle. Puzzles which will test the limits of our wisdom, require all of our prudence, and ask from us all of our cunning and guile.

When I was a boy my father would do seemingly impossible things. I would ask him how, and he would always reply, “Skill and Cunning.” He did this because it was an answer that could apply to anything, even saving the world. We also need to do the seemingly impossible. I know it seems daunting, but perhaps you can start small, and advance the cause by donating. It doesn’t require a lot of skill and cunning, but it requires some.

Read the whole story

tresat

782 days ago

reply

Introducing Test Suites by Tom Tresansky
Thursday November 17^th, 2022 at 11:50 AM

The Gradle Blog

As projects grow in size and complexity and otherwise mature, they tend to accumulate a large collection of automated tests. Testing your software at multiple levels of granularity is important to surface problems quickly and to increase developer productivity.

In Gradle 7.3, released November 2021, the Gradle team introduced a new feature called Declarative Test Suites. Using this feature makes it much easier to manage different types of tests within a single Gradle JVM project without worrying about low level “plumbing” details.

Why Test Suites?

Normally - whether or not you’re practicing strict Test Driven Development - as you develop a project you will continuously add new unit tests alongside your production classes. By convention, for a Java project, these tests live in src/test/java:

These unit tests ensure your classes behave correctly in isolation from the very beginning of your project’s lifecycle. At some point later in development, you will be ready to test how your classes work together to create a larger system using integration tests. Later still, as a final step in certifying that your project works as designed, you will probably want to run the entire system in end-to-end tests which check functional requirements, measure performance, or otherwise confirm its readiness for release.

There are a lot of error-prone details you need to consider in this process. Test Suites was created to improve this situation, in response to the hardships detailed below.

Considerations when setting up additional tests

Varied test goals often involve different and incompatible patterns. At a minimum you’ll want to organize your test code by separating tests into different directories for each goal:

But separating the source files is only the beginning. These types of tests might require different preconditions to be met prior to testing, utilize entirely different runtime and compile time dependencies, or interact with different external systems. The very testing framework (such as JUnit 4 or 5, TestNG or Spock) used to write and run each group of tests could be different.

To correctly model and isolate these differences in Gradle, you’ll need to do more than just separate the tests’ source files. Each group of tests will require its own:

Separate SourceSet, that will provide a distinct Configuration for declaring dependencies for the group to compile and run against. You want to avoid leaking unnecessary dependencies across multiple groups of tests, yet still automatically resolve any shared dependencies needed for test compilation and execution.
Support for using different testing frameworks to execute the tests in each group.
A Test task for each group which might have different task dependencies to provide setup and finalization requirements. You also may want to prevent every type of test from running every time you build the project (for instance, to save any long-running smoke tests for when you think you’re ready to publish).

Each component you create in your build scripts to support these requirements must be properly wired into the Gradle project model to avoid unexpected behavior. Accomplishing this is error-prone, as it requires a thorough understanding of low-level Gradle concepts. It also requires modifications and additions to multiple blocks of the DSL.

That’s hardly ideal; setting up testing is a single concern and the configuration for it should be co-located and easily discoverable within a buildscript.

Wiring integration tests without Test Suites

It’s helpful to look at a complete example. Before diving in, take a moment to think about how you would create a separate set of integration tests within a project.

Before Gradle 7.2, the proper way to set up integration tests went like this (note that while this example is written in the Gradle Kotlin DSL, the Groovy setup is very similar):

We need to create a SourceSet that will in turn create the associated Configurations we’ll need later. This is low-level plumbing that we shouldn’t have to focus on.
We wire the new test configurations to the test existing configurations, to re-use their dependency declarations. We might not always want to do this.
We need to register a Test task to run our tests.
We ought to add the new task to the appropriate group and set a description - not technically necessary, but a best practice to make the task discoverable and properly locate it in reports.
We will write our tests using the latest version of JUnit 5.
We need to set up the classpath of our new task - this even more low-level plumbing.
We need to tell our new task where the classes it runs live.
Finally, we add the necessary JUnit dependencies to the built-in test configurations, which our new configurations extend.¹
The integration tests have an implementation dependency on the current project’s production classes - this looks somewhat extraneous in this block that also configures the production dependencies for the project.

Did you get all that?

The bottom line is that this is simply too complex. You shouldn’t have to be a build expert just to set up thorough testing!

Test Suites - a better way forward

Thinking about the difficulties involved in properly handling this scenario, we realized the current situation was inadequate. We want to support build authors defining multiple groups of tests with different purposes in a declarative fashion while operating at a high level of abstraction. Although you could previously write your own plugin (or use a pre-existing solution such as Nebula) to hide the details of this and mitigate the complexity, testing is such an ubiquitous need that we decided to provide a canonical solution within Gradle’s own core set of plugins. Thus, Test Suites was born.

The JVM Test Suite Plugin provides a DSL and API to model exactly this situation: multiple groups of automated tests with different purposes that live within a single JVM-based project. It is automatically applied by the java plugin, so when you upgrade to Gradle >= 7.3 you’re already using Test Suites in your JVM projects. Congratulations!

Here is the previous example, rewritten to take advantage of Test Suites:

All Test Suite configuration is co-located within a new testing block, of type TestingExtension.
Maintaining backwards compatibility with existing builds that already use the test task was an important requirement for us when implementing Test Suites. We’ve associated the existing test task with a default Test Suite that you can use to contain your unit tests.
Instead of representing the relationship between unit tests and integration tests by making their backing Configurations extend one another, we keep the machinery powering two types of tests completely separate. This allows for more fine-grained control over what dependencies are needed by each and avoids leaking unnecessary dependencies across test types.²
Because each Test Suite could serve very different purposes, we don’t assume they have a dependency on your project (maybe you are testing an external system), so you have to explicitly add one to have access to your production classes in your tests.³

By making use of sensible defaults, Gradle is able to simplify your buildscript significantly. This script manages to set up a mostly equivalent build as the original but in far fewer lines of code. Gradle adds a directory to locate your test code and creates the task that run the tests using the suite name as the basis. In this case, you can run any tests you write located in src/integrationTestJava by invoking gradlew integrationTest.

Test Suites aren’t limited to Java projects, either. Groovy, Kotlin, Scala, and other JVM-based languages will work similarly when their appropriate plugins are applied. These plugins all also automatically apply the JVM Test Suite Plugin, so you can begin adding tests to src/<SUITE_NAME>/<LANGUAGE_NAME> without doing any other configuration.

Behind the scenes

This short example takes care of all the considerations of the pre-Test Suites example above.
But how does it work?

When you configure a suite in the new DSL, Gradle does the following for you:

Creates a Test Suite named integrationTest (typed as a JvmTestSuite).
Creates an SourceSet named integrationTest containing the source directory src/java/integrationTest. This will be registered as a test sources directory, so any highlighting performed by your favorite IDE will work as expected.
Creates several Configurations derived from the integrationTest source set, accessible through the Suite’s own dependencies block: integrationTestImplementation, integrationTestCompileOnly, integrationTestRuntimeOnly, which work like their similarly named test configurations.
Adds dependencies to these configurations necessary for compiling and running against the default testing framework, which is JUnit Jupiter.
Registers a Test task named integrationTest which will run these tests. The most important difference is that using Test Suites will fully isolate any integration test dependencies from any unit test dependencies. It also assumes JUnit Platform as the testing engine for new test suites, unless told otherwise.⁴

After adding just this minimal block of DSL, you are ready to write integration test classes under src/integrationTest/java which are completely separate from your unit tests, and to run them via a new integrationTest task. No contact with low-level DSL blocks like configurations is required.

Try it out now

Test Suites is still an @Incubating feature as we explore and refine the API, but it’s here to stay, and we encourage everyone to try it out now. For a new project, the easiest way to get started is to use the Gradle Init task and opt-in to using incubating features when prompted; this will generate a sample project using the new DSL.

Customizing your Suites

The rationale behind Test Suites, just like Gradle in general, is to abstract the details of configuration and use sensible conventions as defaults - but to also allow you to change those defaults as necessary.

Configure the built-in Test Suite to use a different testing framework using one of several convenience methods available.⁵
Add a non-project dependency for use in compiling and running a Test Suite.
Add a dependency which is only used when running tests (in this case, a logging implementation).
Access the integrationTest task which will be created for this Suite to configure it directly (and lazily⁶), within the testing block.
Define an additional performanceTest Suite using the bare minimum DSL for a default new Test Suite. Note that this suite will not have access to the project’s own classes, or be wired to run as part of the build without calling its performanceTest task directly.⁷
Suites can be used as task dependencies - this will cause the check task to depend on the integrationTest task associated with the integrationTest Test Suite - the same task we configured in <4>.

For more Test Suite custom configuration examples, see the JVM Test Suite Plugin section of the Gradle user guide. For adding an additional test suite to a more complex and realistic build, see the Multi-Project sample.

The future of testing in Gradle

We have many exciting ideas for evolving Test Suites in the future. One major use case we want to support is multidimensional testing, where the same suite of tests runs repeatedly in different environments (for instance, on different versions or releases of the JVM). This is the reason for the seemingly extraneous targets block seen in the examples here and in the user guide. Doing this will likely involve closer Test Suite integration with JVM Toolchains.

You’ll also definitely want to check out the Test Report Aggregation Plugin added in Gradle 7.4, to see how to easily aggregate the results of multiple Test task invocations into a single HTML report. Consolidating test failures and exposing more information about their suites in test reporting is another potential area of future development.

These and other improvements are currently being discussed and implemented, so be sure to keep up to date with this blog and the latest Gradle releases.

Happy testing!

The version is left off junit-jupiter-engine because junit-jupiter-api will manage setting it - but this might look like a mistake. ↩
In the forthcoming Gradle 8.0, this block will use a new strongly-typed dependencies API, which should provide a better experience when declaring test dependencies in your favorite IDE. Watch our releases page for more details. ↩
Note that in the forthcoming Gradle 7.6, instead of using project you’ll need to call the new project() method here. ↩
The default test suite retains JUnit 4 as its default runner, for compatibility reasons. ↩
The default test suite is also implicitly granted access to the production source’s implementation dependencies for the same reason. When switching testing frameworks, the new framework’s dependencies are automatically included. ↩
See the Lazy Configuration section of the user guide for more details on lazy configuration. ↩
Perhaps these tests are meant to exercise a live deployment in a staging environment. ↩

Read the whole story

tresat

961 days ago

reply

Introducing Configuration Caching by Paul Merlin
Sunday August 16^th, 2020 at 8:15 PM

The Gradle Blog

This is the second installment in a series of blog posts about incremental development — the part of the software development process where you make frequent small changes. We will be discussing upcoming Gradle build tool features that significantly improve feedback time around this use case. In the previous post, we introduced file system watching for Gradle 6.5.

In Gradle 6.6 we are introducing an experimental feature called the configuration cache that significantly improves build performance by caching the result of the configuration phase and reusing this for subsequent builds. Using the configuration cache, Gradle can skip the configuration phase entirely when nothing that affects the build configuration, such as build scripts, has changed.

On top of that, when reusing the configuration cache, more work is run in parallel by default and dependency resolution is cached. The isolation of the configuration and execution phases, and the isolation of tasks, make these optimizations possible.

Note that configuration caching is different from the build cache, which caches outputs produced by the build. The configuration cache captures only the state of the configuration phase. It’s also separate from IDE sync and import processes that do not currently benefit from configuration caching.

In order to cache the configuration result, Gradle applies some strong requirements that plugins and build scripts need to follow. Many plugins, including some core Gradle plugins, do not meet these requirements yet. Moreover, support for configuration cache in some Gradle features is not yet implemented. Therefore, your build and the plugins you depend on will likely require changes to fulfil the requirements. Gradle will report problems found with your build logic to assist you in making your build work with the configuration cache.

The configuration cache is currently highly experimental and not enabled by default. We release it early in order to collect feedback from the community while we work on stabilizing the new feature.

That being said, we are committed to making the configuration cache production-ready with the ultimate goal to enable it by default. You can expect that it will get significantly better in the next Gradle releases.

Configuration caching in action

It is recommended to get started with the simplest task invocation possible. Running help with the configuration cache enabled is a good first step:

Running help

Here it is in action from Android Studio, deploying the middle sized Santa Tracker Android application to an Android emulator, making changes to how the snowflakes move on the screen and applying the changes to the emulator:

Build time improvements

The practical impact of the feature depends on a number of factors, but in general it should result in a significant reduction of build time. We’ve seen drastic improvements on large real world builds, let’s have a look at some of them.

Java builds

On a large Java enterprise build with ~500 subprojects and complex build logic, running :help went from 8 seconds down to 0.5 seconds. That’s 16 times faster. Of course, running :help isn’t that useful but it gives an idea of the saved time for the configuration phase. On the same build, running assemble after changing some implementation code went from ~40 seconds down to ~13 seconds, that’s ~3 times faster.

Now let’s look at the gradle/gradle build. It has a hundred subprojects and a fairly complex build logic. You can use it to reproduce these results. Running a test after making an implementation change goes from 16.4 seconds down to 13.8 seconds, skipping the ~2 seconds configuration phase:

In blue you can see the configuration phase, in green the execution phase. On the left, without the configuration cache enabled, configuration phase takes more than 2 seconds and goes down to 214 milliseconds with the configuration cache on the right.

You can also see that the execution phase benefits from the configuration cache but is dominated by compiling and running the tests in that case.

Android builds

Another notable example is a very large real world Android build with ~2500 subprojects. On that build, running :help went from ~25 seconds down to ~0.5 seconds, that’s 50 times faster! Running a more useful build such as assembling the APK after changing some implementation, goes from ~50 seconds down to ~20 seconds, almost 3 times faster.

In the Santa Tracker Android project, we’ve seen the following improvements in the build time for a small implementation change:

The configuration phase is cut in half, from 129 milliseconds down to 63.5 milliseconds. You can also see that the execution phase is accelerated by the configuration cache due to more task parallelisation and caching of dependency resolution.

If you want to reproduce with the above builds or measure your own builds you can use the Gradle Profiler by following the instructions in this repository. Note that the Gradle Profiler will show a slightly different picture, closer to the experience from IDEs, because both use the Gradle Tooling API. This skips the fixed cost of starting the Gradle client JVM that happens when you use the command line.

How does it work?

When the configuration cache is enabled and you run Gradle for a particular set of tasks, for example by running ./gradlew check, Gradle checks whether a configuration cache entry is available for the requested set of tasks. If available, Gradle uses this entry instead of running the configuration phase. The cache entry contains information about the set of tasks to run, along with their configuration and dependency information.

The first time you run a particular set of tasks, there will be no entry in the configuration cache for these tasks and so Gradle will run the configuration phase as normal:

Run init scripts.
Run the settings script for the build, applying any requested settings plugins.
Configure and build the buildSrc project, if present.
Run the builds scripts for the build, applying any requested project plugins.
Calculate the task graph for the requested tasks, running any deferred configuration actions.

Following the configuration phase, Gradle writes the state of the task graph to the configuration cache, taking a snapshot for later Gradle invocations. The execution phase then runs as normal. This means you will not see any build performance improvement the first time you run a particular set of tasks.

When you subsequently run Gradle with this same set of tasks, for example by running ./gradlew check again, Gradle will load the tasks and their configuration directly from the configuration cache, skip the configuration phase entirely and run all tasks in parallel. Before using a configuration cache entry, Gradle checks that none of the “build configuration inputs”, such as build scripts, for the entry has changed. If a build configuration input has changed, Gradle will not use the entry and will run the configuration phase again as above, saving the result for later reuse.

Requirements and limitations

In order to capture the state of the task graph into the configuration cache and reload it again in a later build, Gradle applies certain requirements to tasks and other build logic. Each of these requirements is treated as a configuration cache “problem” and by default causes the build to fail if violations are present.

If any problem is found caching or reusing the configuration, an HTML report is generated to help you diagnose and fix the issues.

Problems Report

If you encounter such problems, your build or the Gradle plugins in use probably need to be adjusted. See the Troubleshooting section of the documentation for more information about how to use this report.

You can find the set of supported core plugins and the set of not yet implemented Gradle features in the configuration cache documentation.

The latest Android Gradle Plugin preview, 4.2.0-alpha07 at the time of writing, works with the configuration cache. The latest Kotlin Gradle Plugin, 1.4.0-RC at the time of writing, works on simple JVM projects emitting some problems. Kotlin 1.4.20 is the current target for a fully compliant plugin. This information can be found at gradle/gradle#13490 alongside the status of the most used community plugins.

Try out configuration caching

If you would like to see how your project benefits from configuration caching, here is how you can try it out.

First, make sure you run Gradle 6.6 or later. In order to enable configuration caching, you need to pass --configuration-cache on the command line. Alternatively, add

org.gradle.unsafe.configuration-cache=true

to the gradle.properties file in the project directory or in the Gradle user home, so you don’t need to pass the command-line option on every build. That’s it: the next build will run with configuration caching enabled.

Keep in mind that you will only see performance improvements when subsequent builds with the same requested tasks have the feature enabled. If you want to benchmark your build, you can do it easily with Gradle Profiler by following the instructions in this repository.

If you run into any problems, check out the supported core plugins or community plugins, learn how to troubleshoot in the user manual. You can also read our recommended adoption steps.

If you still have problems open a Gradle issue if you think the problem is with Gradle, or check the supported community plugins issue at gradle/gradle#13490. You can also get help in the #configuration-cache channel in the Gradle community Slack.

Read the whole story

tresat

1784 days ago

reply

GPT-2 As Step Toward General Intelligence by Scott Alexander
Tuesday March 5^th, 2019 at 9:29 PM

Slate Star Codex

A machine learning researcher writes me in response to yesterday’s post, saying:

I still think GPT-2 is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.

I resisted the urge to answer “Yeah, well, your mom is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.”

But I think it would have been true.

A very careless plagiarist takes someone else’s work and copies it verbatim: “The mitochondria is the powerhouse of the cell”. A more careful plagiarist takes the work and changes a few words around: “The mitochondria is the energy dynamo of the cell”. A plagiarist who is more careful still changes the entire sentence structure: “In cells, mitochondria are the energy dynamos”. The most careful plagiarists change everything except the underlying concept, which they grasp at so deep a level that they can put it in whatever words they want – at which point it is no longer called plagiarism.

GPT-2 writes fantasy battle scenes by reading a million human-written fantasy battle scenes, distilling them down to the concept of a fantasy battle scene, and then building it back up from there. I think this is how your mom (and everyone else) does it too. GPT-2 is worse at this, because it’s not as powerful as your mom’s brain. But I don’t think it’s doing a different thing. We’re all blending experience into a slurry; the difference is how finely we blend it.

“But don’t humans also have genuinely original ideas?” Come on, read a fantasy book. It’s either a Tolkien clone, or it’s A Song Of Ice And Fire. Tolkien was a professor of Anglo-Saxon language and culture; no secret where he got his inspiration. A Song Of Ice And Fire is just War Of The Roses with dragons. Lannister and Stark are just Lancaster and York, the map of Westeros is just Britain (minus Scotland) with an upside down-Ireland stuck to the bottom of it – wake up, sheeple! Dullards blend Tolkien into a slurry and shape it into another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, and the artistic corpus into a slurry and form it into an entirely new genre. Again, the difference is how finely you blend and what spices you add to the slurry.

“But don’t scientists have geniunely original ideas?” Scientists are just finding patterns in reality nobody has ever seen before. You say “just a pattern-matcher”, I say “fine, but you need to recognize patterns in order to copy them, so it’s necessarily a pattern-recognizer too”. And Einstein was just a very good pattern-recognizer.

“But don’t humans have some kind of deep understanding that pattern-recognition AIs don’t?”

Here’s a fun question: the human brain is undoubtedly the most powerful computer in the known universe. In order to do something as simple as scratch an itch it needs to solve exquisitely complex calculus problems that would give the average supercomputer a run for its money. So how come I have trouble multiplying two-digit numbers in my head?

The brain isn’t directly doing math, it’s creating a model that includes math and somehow doing the math in the model. This is hilariously perverse. It’s like every time you want to add 3 + 3, you have to create an entire imaginary world with its own continents and ecology, evolve sentient life, shepherd the sentient life into a civilization with its own mathematical tradition, and get one of its scholars to add 3 + 3 for you. That we do this at all is ridiculous. But I think GPT-2 can do it too.

Janelle Shane asks GPT-2 its ten favorite animals:

Prompt: My 10 favorite animals are: 1.

My ten favorite animals are:
1. Zebras with a white scar on the back
2. Insiduous spiders and octopus
3. Frog with large leaves, hopefully black
4. Cockatiel with scales
5. Razorbill with wings hanging about 4 inches from one’s face and a heart tattoo on a frog
3. Cockatric interlocking tetrabods that can be blind, cut, and eaten raw:
4. Black and white desert crocodiles living in sunlight
5. Zebra and many other pea bugs

Forget the particular animals for now (don’t worry, the razorbill with wings hanging about 4 inches from one’s face will be back in your nightmares). Notice the way it numbers its list: 1, 2, 3, 4, 5, 3, 4, 5. Last week the two-year-old child who lives next to me counted 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 8, 9, 7, 8, 9, 7, 8, 9 (his mother warns this is an inexact transcription). GPT-2 is instantiated on giant supercomputers; it’s a safe bet they could calculate the square root of infinity in a picosecond. But it counts more or less the same way as a two-year old. GPT-2 isn’t doing math. It’s doing the ridiculous “create a universe from first principles and let it do the math” thing that humans do in their heads. The fact that it counts so badly suggests it’s counting human-style, which makes it amazing that it can count at all.

I find something similar in this post from Tumblr user antinegationism, playing with the lobotomized public-release version of the system:

The American Association for Suicide Prevention and Life-Threatening Behavior (AAPSLB), a professional organization with an overarching goal of promoting suicide prevention, released a recent video encouraging the public to think before they act, on the public’s part. “When we say we’re the only reason to commit suicide, we’re saying they’re wrong. It’s the right thing to do,” said AAPSLB president Steven A. Anderson.

The American Association For Suicide Prevention And Life-Threatening Behavior is not a real organization; the AI made it up as the kind of organization that it thought would feature in a story like this. And AAPSLB is not quite the right way to acronymize the organization’s name. But it’s clearly an attempt at doing so. It’s very close. And nobody taught it how to do that! It’s not just that nobody programmed it in. It’s that nobody thought “Today I shall program an AI to learn how to acronymize on its own in an unsupervised way”. GPT-2 was just programmed to predict text from other text, nothing else. It’s second-level not programmed in. It just happened!

And, uh, it seems to have figured out how to translate things into French. This part is from the official paper:

We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format ENGLISH SENTENCE = FRENCH SENTENCE and then after a final prompt of ENGLISH SENTENCE = we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon inferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lampleet al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach(Artetxe et al., 2019). Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step.

In other words: GPT-2 is very bad at translating French into English. But the researchers were surprised to see it could do this at all, since they didn’t design it as translation software, didn’t ask it to learn translation, and didn’t show it any material in French. It seems to have picked up this ability from noticing a few naturally-occurring examples of French in English text:

And here’s One Weird Trick to make GPT-2 summarize articles:

We test GPT-2’s ability to perform summarization on the CNN and Daily Mail dataset (Nallapati et al., 2016). To induce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sampling (Fan et al., 2018) with k=2 which reduces repetition and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary. While qualitatively the generations resemble summaries, as shown in Table 14, they often focus on recent content from the article or confuse specific details such as how many cars were involved in a crash or whether a logo was on a hat or shirt. On the commonly reported ROUGE 1,2,L metrics the generated summaries only beginto approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article. GPT-2’s performance drops by 6.4 points onthe aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.

Again, GPT-2 isn’t good at summarizing. It’s just surprising it can do it at all; it was never designed to learn this skill. All it was designed to do was predict what words came after other words. But there were some naturally-occurring examples of summaries in the training set, so in order to predict what words would come after the words tl;dr, it had to learn what a summary was and how to write one.

If you’re me, the obvious next question is “can it do poetry”. I am very interested to know if the real thing can manage this, but for now I will have to satisfy myself with the lobotomized public-release version. Given Alexander Pope’s Essay On Criticism as a prompt, it says:

O man, who pains the tongue magnified by tymolle,
Though all the fair melodies let no door go as dark as good art.
Yet it would have been, unless the mastermind is so full
Of his own want that this literary country might grow.
Perfect student, thou hast trodden out.
Happy students ruin the purity of what you write ‘Tis art,
For unordered use of decade,’ but art only failing:
A surprise, for new veins which build behind;

It understands there should be line breaks, it understands the approximate correct length of a line of iambic pentameter, it understands how to talk like an overeducated 18th-century dandy – but it doesn’t appreciate rhyme or meter. In retrospect this isn’t surprising; GPT has no idea words sound like anything; it would be shocked to learn anyone uses language as anything other than text strings.

But human poets haven’t appreciated rhyme or meter either for at least a century, so GPT-2 should be able to compose some wicked modern poetry. Friends-of-the-blog Chelsea V. and Qiaochu Y. got it to write a few more stanzas of Allen Ginsberg’s Moloch (human-curated to remove fluff and improve flow):

Moloch whose brief mind is the barrel of gleaming stars! Moloch whose soul is choices of fragrance! Moloch whose framing is glitz and half-cheerios!

Moloch whose daily life is private; his daily strength hoarse! Moloch whose soft desires become culture burning smoke in a palace of eyes! Moloch whose breeding in the caves sweeps wild and rippling flower-like shapes!

Moloch whose pleading is the breath of youth! Moloch whose prudence is the urine of generations! Moloch whose nerves are not privileged; movies abound in quiet caves; people are dressed like ministers! Moloch whose hope receives some help too. Moloch whose eyes are pitiful outstretched, but not blinded. Moloch whose lungs are burning with fire, but not a cautious electrical girl!

Moloch whose sanity is just filled with atoms, gluing beings in physics and complicated libraries. Moloch whose eyes are all innocent animals! Moloch whose sweat and efforts make pure the ocean with burning annihilation! Moloch whose fields dissolve to lightning-ravaged storage houses! Moloch whose eyes sit haunting up Scottish springs!

Moloch whose arms and men are Judah condemns and listens, having understood whom he was and what he man’s weaknesses oppose with iron. We will feast to it like Lord Kippa; we shall play with it like our Pokemon.

Moloch flickering under the pounds and sustaining more fights! Sports and yoga, equally greedy for wealth as for lives! Moloch who had to split his limbs and society into fifty-two atoms! Moloch who cannot survey duty! Moloch for I do not believe in humanity! Moloch for I don’t believe in god! Moloch who, fought Sins and made Sin out of Sin!

Moloch who left me in the glacier! Moloch who helped me to judge my behavior! Moloch who lured me out of the eternal anguish. Moloch who makes all my faults my undoings! Moloch whose deep conversation filled my soul with love and consolation! Moloch who curls up my brains and crumps my heads and screams my scream! Moloch who prepares oil in my hands and burns me alive! Moloch who cries out, “Why do men scream so loud?” Moloch who unleashes his poison in all rivers at high tide! Moloch whose ruins are filled with its cubits! Moloch who fiddles at insidious clockwork!

Aside from this being obviously one of the best things ever, what’s the point?

The point is, GPT-2 has faculties. It has specific skills, that require a certain precision of thought, like counting from one to five, or mapping a word to its acronym, or writing poetry. These faculties are untaught; they arise naturally from its pattern-recognition and word-prediction ability. All these deep understanding things that humans have, like Reason and so on, those are faculties. AIs don’t have them yet. But they can learn.

From the paper:

Prompt: Who was the author of The Art Of War?
Sun Tzu

Prompt: State the process that divides one nucleus into two genetically identical nuclei?
Mitosis

Prompt: Do you have to have a gun permit to shoot at a range?
No

Nobody told the model to learn Chinese history, cell biology, or gun laws either. It learned them in the process of trying to predict what word would come after what other word. It needed to know Sun Tzu wrote The Art Of War in order to predict when the words “Sun Tzu” would come up (often in contexts like “The Art of War, written by famous Chinese general…). For the same reason, it had to learn what an author was, what a gun permit was, etc.

Imagine you prompted the model with “What is one plus one?” I actually don’t know how it would do on this problem. I’m guessing it would answer “two”, just because the question probably appeared a bunch of times in its training data.

Now imagine you prompted it with “What is four thousand and eight plus two thousand and six?” or some other long problem that probably didn’t occur exactly in its training data. I predict it would fail, because this model can’t count past five without making mistakes. But I imagine a very similar program, given a thousand times more training data and computational resources, would succeed. It would notice a pattern in sentences including the word “plus” or otherwise describing sums of numbers, it would figure out that pattern, and it would end up able to do simple math. I don’t think this is too much of a stretch given that GPT-2 learned to count to five and acronymize words and so on.

Now imagine you prompted it with “P != NP”. This time give it near-infinite training data and computational resources. Its near-infinite training data will contain many proofs; using its near-infinite computational resources it will come up with a model that is very very good at predicting the next step in any proof you give it. The simplest model that can do this is probably the one isomorphic to the structure of mathematics itself (or to the brains of the sorts of mathematicians who write proofs, which themselves contain a model of mathematics). Then you give it the prompt P != NP and it uses the model to “predict” what the next step in the proof will be until it has a proof, the same way GPT-2 predicts the next word in the LotR fanfiction until it has a fanfiction.

The version that proves P != NP will still just be a brute-force pattern-matcher blending things it’s seen and regurgitating them in a different pattern. The proof won’t reveal that the AI’s not doing that; it will just reveal that once you reach a rarefied enough level of that kind of thing, that’s what intelligence is. I’m not trying to play up GPT-2 or say it’s doing anything more than anyone else thinks it’s doing. I’m trying to play down humans. We’re not that great. GPT-2-like processes are closer to the sorts of things we do than we would like to think.

Why do I believe this? Because GPT-2 works more or less the same way the brain does, the brain learns all sorts of things without anybody telling it to, so we shouldn’t be surprised to see GPT-2 has learned all sorts of things without anybody telling it to – and we should expect a version with more brain-level resources to produce more brain-level results. Prediction is the golden key that opens any lock; whatever it can learn from the data being thrown at it, it will learn, limited by its computational resources and its sense-organs and so on but not by any inherent task-specificity.

Wittgenstein writes: “The limits of my language mean the limits of my world”. Maybe he was trying to make a restrictive statement, one about how we can’t know the world beyond our language. But the reverse is also true; language and the world have the same boundaries. Learn language really well, and you understand reality. God is One, and His Name is One, and God is One with His Name. “Become good at predicting language” sounds like the same sort of innocent task as “become good at Go” or “become good at Starcraft”. But learning about language involves learning about reality, and prediction is the golden key. “Become good at predicting language” turns out to be a blank check, a license to learn every pattern it can.

I don’t want to claim this is anywhere near a true AGI. “This could do cool stuff with infinite training data and limitless computing resources” is true of a lot of things, most of which are useless and irrelevant; scaling that down to realistic levels is most of the problem. A true AGI will have to be much better at learning from limited datasets with limited computational resources. It will have to investigate the physical world with the same skill that GPT investigates text; text is naturally machine-readable, the physical world is naturally obscure. It will have to have a model of what it means to act in the world, to do something besides sitting around predicting all day. And it will have to just be better than GPT, on the level of raw power and computational ability. It will probably need other things besides. Maybe it will take a hundred or a thousand years to manage all this, I don’t know.

But this should be a wake-up call to people who think AGI is impossible, or totally unrelated to current work, or couldn’t happen by accident. In the context of performing their expected tasks, AIs already pick up other abilities that nobody expected them to learn. Sometimes they will pick up abilities they seemingly shouldn’t have been able to learn, like English-to-French translation without any French texts in their training corpus. Sometimes they will use those abilities unexpectedly in the course of doing other things. All that stuff you hear about “AIs can only do one thing” or “AIs only learn what you program them to learn” or “Nobody has any idea what an AGI would even look like” are now obsolete.

Read the whole story

tresat

2314 days ago

reply

Where the Magic Doesn’t Happen by Andy Crouch Thursday January 2nd, 2025 at 9:17 AM

Where the Magic Doesn’t Happen

Why Match School And Student Rank? by Scott Alexander Sunday August 6th, 2023 at 6:16 PM

The Modern Landscape of Harm by Jeremiah Monday May 15th, 2023 at 7:04 AM

Introducing Test Suites by Tom Tresansky Thursday November 17th, 2022 at 11:50 AM

Why Test Suites?

Considerations when setting up additional tests

Wiring integration tests without Test Suites

Test Suites - a better way forward

Behind the scenes

Try it out now

Customizing your Suites

The future of testing in Gradle

Introducing Configuration Caching by Paul Merlin Sunday August 16th, 2020 at 8:15 PM

Configuration caching in action

Build time improvements

Java builds

Android builds

How does it work?

Requirements and limitations

Try out configuration caching

GPT-2 As Step Toward General Intelligence by Scott Alexander Tuesday March 5th, 2019 at 9:29 PM

Where the Magic Doesn’t Happen by Andy Crouch
Thursday January 2^nd, 2025 at 9:17 AM

Why Match School And Student Rank? by Scott Alexander
Sunday August 6^th, 2023 at 6:16 PM

The Modern Landscape of Harm by Jeremiah
Monday May 15^th, 2023 at 7:04 AM

Introducing Test Suites by Tom Tresansky
Thursday November 17^th, 2022 at 11:50 AM

Introducing Configuration Caching by Paul Merlin
Sunday August 16^th, 2020 at 8:15 PM

GPT-2 As Step Toward General Intelligence by Scott Alexander
Tuesday March 5^th, 2019 at 9:29 PM