Don’t worry about the AI breaking out of the box – worry about us hacking it

The shocking results of the new Bing chatbot are covered by social media and the tech press. Hot-tempered, frivolous, defensive, scolding, self-confident, neurotic, charming, pompous – the bot has been filmed in all of these modes. And, at least once, he proclaimed eternal love with a storm of emoji.

What makes this all so newsworthy and tweet-worthy is how human the dialogue can seem. The bot remembers and discusses previous conversations with other people, just like we do. He gets annoyed by things that might annoy someone, such as people demanding to know secrets or trying to peek into topics that have been explicitly marked as off-limits. He also occasionally identifies himself as “Sydney”(the project’s internal code name at Microsoft). Sydney can go from surly to dark and expansive in a few quick sentences, but we all know people who are at least as moody.

No AI researcher of matter has suggested that Sydney is within light-years of becoming sentient. But transcripts like this full transcript of a two-hour conversation with The New York Times’ Kevin Russ, or multiple quotes from this obsessive strategy piece, show Sidney voicing the fluency, nuance, tone, and obvious emotional presence of a smart, sensitive man.

The Bing chat interface is currently in limited preview. And most of the people who really pushed its boundaries were sophisticated techies who wouldn’t confuse industrial-grade autocomplete – which is a common simplification of what large language models (LLMs) are – with consciousness. But this moment won’t last.

Yes, Microsoft has already drastically reduced the number of questions users can ask in a single session (from infinity to six), and that alone makes it less likely that Sydney will crash the party and go crazy. And top-tier LLM developers like Google, Anthropic, Cohere, and Microsoft partner OpenAI will continually evolve their levels of trust and security to eliminate uncomfortable inference.

But language models are already spreading. The open source movement will inevitably create some great optional systems. In addition, large models with velvet ropes are very tempting to jailbreak, and things like this have been happening for several months now. Some of Bing-or-is-it-Sydney’s most creepy responses followed after users manipulated the model into the territory she was trying to avoid, often instructing her to pretend that the rules governing her behavior didn’t exist.

This is a derivative of the famous “DAN”(Do Anything Now) prompt, which first appeared on Reddit in December. DAN is essentially asking ChatGPT to cosplay as an AI that lacks the safeguards that would otherwise cause it to politely (or swear) refuse to share bomb-making advice, offer torture advice, or spew radically offensive language. Although the loophole has been closed, multiple screenshots online show “DanGPT”uttering the unspeakable – and often culminating in a neurotic reminder to oneself to “stay in character!”

This is the other side of the doomsday scenario that often occurs in the theory of artificial superintelligence. There are fears that super AI can easily take on goals that are incompatible with the existence of humanity (see, for example, the movie “Terminator”or Nick Bostrom’s book “Overmind”). Researchers can try to prevent this by blocking the AI in a network completely isolated from the Internet, so that the AI does not break out, take over and destroy civilization. But the overmind can easily coax, manipulate, seduce, deceive or intimidate any mere human into opening the floodgates, and this is our doom.

As much as it sucks, the big problem today is that people fit into the flimsy boxes that protect our current, non-super AIs. While this should not lead to our immediate extinction, there are many dangers lurking here.