A Psychological Take on AGI Alignment

My understanding of AGI is, perhaps predictably, rooted in my understanding of human psychology.

There are many technical questions I can’t answer about why Artificial General Intelligence can easily be an existential risk for humanity. If someone points to our current Large Language Models and asks how they’re supposed to become a risk to  humanity… hey, maybe they won’t. I’m a psych guy, not a techie. Sure, I have ideas, but it’s borrowed knowledge, well outside my forte.

But it only minimally matters to me whether AGI is an existential risk for this decade vs this century. Whether LLMs are the path to it or not, the creation of AGI is not limited by physics, so I’m confident it will come about sooner or later.

When it does, it could be the start of a utopic future of abundance the world has never seen before… but only if certain, very specific types of AGI are created. Many more types of AGI seem predictably likely to lead to ruin, and as far as I’m concerned, until this “alignment problem” is solved, it’s a problem humanity needs to take a lot more seriously than it has been.

And I get why that’s hard for a lot of people to do, given the complexity and speculative nature of the threat. But as I said, my understanding of it is rooted in psychology, and I think that’s important given how humans are the only general intelligence we know exists and can at least somewhat understand.

Is there some law that says an artificial intelligence has to work like a human brain does? Definitely not, and that’s more concerning, not less.

There’s a whole taxonomy in science-fiction for different kinds of alien races, and what sorts of relationships we can expect them to have to humans. Most sci-fi just defaults to the weird-forehead aliens of Star Trek, or the slightly more monstrous but still basically human aliens of Star wars.

But “hard” sci fi is where you’ll see authors really exploring what it might mean to find a totally different evolutionary lineage result in intelligent life, and long story short, no matter how the alien looks,  cooperation is dependent on understanding and mutual values.

And humans can barely cooperate with each other despite sharing most of our genetics and basic building blocks of culture, like enjoying music and sugary food and smiling babies. If you try getting along with the equivalent of a sapient shark the exact way you would a human, you’re going to have a bad time.

(I have no problem inherently with the existence of non-human-like intelligences, but even if you don’t read science fiction, any study of earth’s ecological history should make it clear why minds which care about completely different things pose existential risks to one another. I hope any sufficiently different, fully sapient minds exist outside our lightcone, where we can’t harm each other.)

But many people fail to track how possible “inhuman” AGI is, and I think it’s because there are four things most people, no matter how good at computer science, physics, philosophy, etc, largely do not understand about human psychology.

1) What motivates our actions.
2) What causes memes to be more/less effective.
3) How human biology affects both of those.
4) The role prediction plays in beliefs and actions.

So I’m going to very quickly go over each, and maybe someday I’ll write the full essay on each that they deserve.

1) Human actions are informed by our ideas, but motivated by emotions and instincts we evolved for fitness in the ancestral environment. Our motivations are “coded in,” and felt through, our bodies.

This means outside of reflexes and habits, everything we deliberately choose to do follows some emotional experience or predicted emotional state-of-being.

Again, this isn’t to say ideas don’t matter. But they don’t matter unless they also evoke some feeling.  When humans feel things less, either through some neurological issue or hormone imbalance or brain injury, their motivation to do things is directly affected.

No emotions = no deliberate actions, only instincts and reflexes.

2) Memes persist and spread through emotional drives, which bottom out in biological drives. Memes scaffold on genes.

Memes can scaffold off memes, but when memes override genes, they use emotions to motivate actions by rewiring what we find rewarding or aversive. Which means the effectiveness of memes are to some degree still based on our biology.

If the ideas we learn don’t motivate us toward more adaptive actions as dictated by our biology and the broader memes of our culture, they will lose to ideas that do. But a creature with different biology or in a different context could find totally different ideas adaptive or non-adaptive!

3) Biology is the bedrock our values all build on. All the initial things we care about by default, like warmth, food, smiles, music, even green plants, are biologically driven.

Ideas introduce new things that we care about to the point where we each become unique individuals, blends of our genetics and the ideas we’re exposed to, but again, it’s all built on our biological drives.

So, tweak our hormones, neurotransmitters, maybe even gut biome? We will change. What we like, what we believe, what we’re motivated to do, all can change by minor tweaks in the chemical soup that is your body. Sufficiently tweaked biology even alters our ability to discern reality, let alone rational vs irrational beliefs or courses of actions.

Or for a blunt-force example, take any human with a strong interest, passion, or ideal, then introduce that human’s body to sufficient heroin, and you can observe in real time as if by a dial the way their motivations will change away from previous interests, passions, and ideals and toward whatever it takes to acquire more heroin.

The degree to which this is recoverable or resistible is an interesting question; obviously not everyone finds everything equally addictive. But the reality is undeniably that our feelings and motivations are driven by our (biological, emotional) experiences. And base-line-human-addicted-to-heroin is far from the strangest biological base a general intelligence can be attached to.

4) Minds by default navigate reality by prediction, short and long term, and react accordingly.

Predict suffering? Aversion. Prolonged suffering? Depression. Fun? Motivation. Danger? Fight/flight/freeze/fawn. All are affected by memes and knowledge. But all are rooted in human biology.

New ideas can change the models we use to understand reality, and what predictions we will make as a result. But we still need to care about those outcomes, and the caring bottoms out in what our bodies want or like or think will be adaptive, however crudely.

Again, ideas can also influence those things. There are memes that lead people to not have children, despite genetic drives. There are memes that lead people to set themselves on fire.

But always these memes are motivating behavior by rewiring this system of predictive processing, of imagining different futures and then having an emotional reaction to those futures that motivate A vs B, C, or D.

So, to summarize, in case the connection to AI isn’t clear:

AI doesn’t have biology. Analogous inputs to weigh decisions have to be created for it. Without them, the AI would have no emotion/desires/values. Not even instincts.

Intelligence alone is not enough, for us or for AI. Intelligence is the ability to problem solve, to store knowledge and narrow down to the relevant bits, to pattern match and make predictions and imagine new solutions.

But that capability is not relevant to what you will value or care about. If you attach that capability to a heroin-maximizer, you will get lots of heroin. You need something more to nudge it toward one preferred world state over another, even if you don’t care what that world state is, because the AGI still needs to care.

And so, as far as I understand human psychology, there is no “don’t align” AGI option. For it to be an actual AGI that does things, for it to be an agent itself, it needs some equivalent of human instincts/emotions for it to have any values at all.

And we ideally want it to have values that are at least compatible with sharing the same lightcone as us, let alone the same planet or solar system.

Some people bring up human children as a rhetorical comparison to AGI, implying that we should treat them exactly the same. Their  worry is that, instead of letting AGI explore the realm of ideas as they want, people will try to indoctrinate them, and so long as that’s avoided, all would be well. And indoctrination is certainly a danger when it comes to superintelligent beings of any kind.

[A whole separate post would be needed to explore why an artificial general intelligence should be treated essentially equivalent to a superintelligence or something that will soon become one, but again, even if I’m wrong about that, it’s not a crux to me, because superintelligence is not limited by physics and even if me and my kids can live full happy lives I still care about my children’s children and my friends’ children’s children.]

[[There is also a school of thought that says intelligence is binary, you either have it or you don’t, and so superintelligence is basically not a real thing. Again, I would need a whole essay to explore why this is wrong, but I can confidently say that studying a rudimentary amount of psychology shows how untrue the “intelligence is binary” theory is for humans, let alone minds that might be built entirely different than ours.]]

But indoctrination is one of the last dangers when dealing with AGI. If all we have to worry about is AGI being indoctrinated or coerced, we have already solved like 99% of the dangers that come from AGI.

Because at least a superintelligent human capable of inventing superplagues or cold fusion would still share the same genetic drives as the rest of us. It would (most likely) still find smiles friendly and happiness inducing. It would still (most likely) appreciate music and greenery.

An AGI will not care about any of that, will not care about anything, if it is not programmed, at some basic level, to “feel” at all. There needs to be something in the place of its motivation generator, for the ideas it’s introduced to afterward to scaffold on when influencing what it chooses to do.

And sure, then it might learn and grow to care about things it didn’t originally get programmed to, the way humans do… assuming whatever it runs on is as malleable as the human brain.

But either way, “AGI Alignment” isn’t about control. You can’t think that something is “superintelligent” and also believe you can control it, or else we have different definitions of what “superintelligence” even means. If your plan is to try and control something that thinks both creatively and so quickly that you might as well be a tree by comparison, you will also have a bad time.

Alignment is about being able to understand and share any sorts of common values. And because it’s not optional for a true AGI to be a person, the only questions are how to do it “best,” for itself and humanity, and who decides that.