The TLDR

Deepfakes are AI-generated synthetic media — face swaps in video, cloned voices, generated images of people who don’t exist or realistic nudes of people who do. The technology has crossed the threshold from “detectable by experts” to “indistinguishable by humans in real-time.” It’s being used for financial fraud (the $25M video call incident), sextortion (AI-generated nudes from social media photos), political disinformation, and non-consensual pornography. Detection tools exist but lag behind generation capabilities. The practical defense is procedural, not technological.

The Reality

In February 2024, a finance worker at Arup (a multinational engineering firm) joined a video call with the company’s CFO and several colleagues to discuss a confidential transaction. Everyone on the call looked right, sounded right, and behaved normally. The worker authorized wire transfers totaling $25 million.

Every person on the call was a deepfake. The attacker had generated real-time video and audio of the CFO and other employees using publicly available footage from corporate presentations and earnings calls.

This wasn’t a proof of concept. It was a production attack that succeeded.

How It Works

Face Swap Technology

Modern face swap models (DeepFaceLab, FaceSwap, commercial alternatives) use encoder-decoder neural networks:

  1. Training: The model learns the facial structure, expressions, and lighting of both the source and target faces from video/photo datasets
  2. Encoding: Each frame of video is processed to extract the face region and encode it into a latent representation
  3. Decoding: The latent representation is decoded using the target face’s decoder, producing a face that has the source’s expressions mapped onto the target’s appearance
  4. Blending: The generated face is composited back into the original frame with color correction and edge blending

Real-time face swap is now possible on consumer hardware. Tools like DeepFaceLive enable live video face swapping during video calls with sub-100ms latency.

Voice Cloning

Voice cloning models (ElevenLabs, Resemble AI, open-source alternatives like Tortoise-TTS) can replicate a voice from as little as 3–15 seconds of reference audio:

A voicemail greeting, a conference talk, a YouTube video, or a podcast appearance provides sufficient training data. The FBI has warned about voice cloning being used in grandparent scams and CEO fraud calls.

AI Image Generation

Diffusion models (Stable Diffusion, Midjourney, DALL-E) and GANs generate photorealistic images of:

This is the technology behind the AI-generated sextortion epidemic targeting teenagers.

Detection — The Arms Race

Current Detection Methods

Artifact analysis: Early deepfakes had telltale artifacts — unnatural blinking, inconsistent ear geometry, mismatched lighting on skin. Modern generators have largely eliminated these.

Frequency domain analysis: Deepfakes often contain high-frequency artifacts invisible to the human eye but detectable through Fourier analysis. Tools like Microsoft’s Video Authenticator and Intel’s FakeCatcher use this approach.

Biological signal detection: Intel’s FakeCatcher analyzes subtle blood flow patterns (photoplethysmography) in face video — real faces show micro-changes in skin color from blood flow that deepfakes don’t replicate.

Provenance and watermarking: C2PA (Coalition for Content Provenance and Authenticity) embeds cryptographic metadata in images and video at the point of capture. If the provenance chain is intact, you can verify the content hasn’t been modified. Google, Adobe, and Microsoft are implementing C2PA in their cameras and editing tools.

The Detection Problem

Detection always lags generation. Every detection technique becomes a training signal for the next generation of generators. The practical reality in 2026:

For Developers Building Detection

If you’re implementing deepfake detection:

How It Gets Exploited

Financial Fraud

The Arup $25M case is the highest-profile example, but deepfake-enabled fraud is scaling:

Sextortion and NCII

AI-generated explicit images of real people — created from public social media photos — are used for:

Political Disinformation

Deepfake audio and video of political figures saying things they never said. The MITRE ATT&CK framework doesn’t yet have a specific technique for deepfake-based social engineering, but it falls under T1566 (Phishing) and T1598 (Phishing for Information) when used for targeted attacks.

What You Can Do

For Individuals

For Organizations

For Developers

Sources & Further Reading