The TLDR
Deepfakes are AI-generated synthetic media — face swaps in video, cloned voices, generated images of people who don’t exist or realistic nudes of people who do. The technology has crossed the threshold from “detectable by experts” to “indistinguishable by humans in real-time.” It’s being used for financial fraud (the $25M video call incident), sextortion (AI-generated nudes from social media photos), political disinformation, and non-consensual pornography. Detection tools exist but lag behind generation capabilities. The practical defense is procedural, not technological.
The Reality
In February 2024, a finance worker at Arup (a multinational engineering firm) joined a video call with the company’s CFO and several colleagues to discuss a confidential transaction. Everyone on the call looked right, sounded right, and behaved normally. The worker authorized wire transfers totaling $25 million.
Every person on the call was a deepfake. The attacker had generated real-time video and audio of the CFO and other employees using publicly available footage from corporate presentations and earnings calls.
This wasn’t a proof of concept. It was a production attack that succeeded.
How It Works
Face Swap Technology
Modern face swap models (DeepFaceLab, FaceSwap, commercial alternatives) use encoder-decoder neural networks:
- Training: The model learns the facial structure, expressions, and lighting of both the source and target faces from video/photo datasets
- Encoding: Each frame of video is processed to extract the face region and encode it into a latent representation
- Decoding: The latent representation is decoded using the target face’s decoder, producing a face that has the source’s expressions mapped onto the target’s appearance
- Blending: The generated face is composited back into the original frame with color correction and edge blending
Real-time face swap is now possible on consumer hardware. Tools like DeepFaceLive enable live video face swapping during video calls with sub-100ms latency.
Voice Cloning
Voice cloning models (ElevenLabs, Resemble AI, open-source alternatives like Tortoise-TTS) can replicate a voice from as little as 3–15 seconds of reference audio:
- Text-to-speech cloning: Generate speech in a cloned voice from text input
- Voice conversion: Transform one speaker’s live speech into another’s voice in real-time
- Emotional modeling: Modern systems reproduce not just timbre but emotional inflection, pausing patterns, and speaking rhythm
A voicemail greeting, a conference talk, a YouTube video, or a podcast appearance provides sufficient training data. The FBI has warned about voice cloning being used in grandparent scams and CEO fraud calls.
AI Image Generation
Diffusion models (Stable Diffusion, Midjourney, DALL-E) and GANs generate photorealistic images of:
- People who don’t exist: Used for fake profiles in social engineering and romance scams
- People who do exist in fabricated scenarios: Using techniques like DreamBooth or LoRA fine-tuning, an attacker can generate realistic images of a real person in any setting — including explicit scenarios — from a handful of public photos
This is the technology behind the AI-generated sextortion epidemic targeting teenagers.
Detection — The Arms Race
Current Detection Methods
Artifact analysis: Early deepfakes had telltale artifacts — unnatural blinking, inconsistent ear geometry, mismatched lighting on skin. Modern generators have largely eliminated these.
Frequency domain analysis: Deepfakes often contain high-frequency artifacts invisible to the human eye but detectable through Fourier analysis. Tools like Microsoft’s Video Authenticator and Intel’s FakeCatcher use this approach.
Biological signal detection: Intel’s FakeCatcher analyzes subtle blood flow patterns (photoplethysmography) in face video — real faces show micro-changes in skin color from blood flow that deepfakes don’t replicate.
Provenance and watermarking: C2PA (Coalition for Content Provenance and Authenticity) embeds cryptographic metadata in images and video at the point of capture. If the provenance chain is intact, you can verify the content hasn’t been modified. Google, Adobe, and Microsoft are implementing C2PA in their cameras and editing tools.
The Detection Problem
Detection always lags generation. Every detection technique becomes a training signal for the next generation of generators. The practical reality in 2026:
- Expert analysis can still detect most deepfakes with time
- Automated detection tools catch 80–90% of synthetic media but miss the best
- Real-time detection during a live video call is unreliable
- The average person cannot distinguish high-quality deepfakes from real media
For Developers Building Detection
If you’re implementing deepfake detection:
- Don’t rely on a single detection method — ensemble approaches combining artifact analysis, frequency analysis, and provenance checking are more robust
- Assume adversarial evasion — attackers will test against your detection pipeline
- C2PA provenance is the strongest signal because it’s cryptographic, not heuristic
- False positives destroy trust — a detection system that flags real content as fake is worse than no detection
How It Gets Exploited
Financial Fraud
The Arup $25M case is the highest-profile example, but deepfake-enabled fraud is scaling:
- Voice cloning for CEO fraud calls (vishing with cloned executive voices)
- Deepfake video for KYC bypass (creating synthetic video of the account holder to pass identity verification)
- Real-time face swap during live video identity verification
Sextortion and NCII
AI-generated explicit images of real people — created from public social media photos — are used for:
- Sextortion targeting teenagers and adults
- Non-consensual intimate imagery (NCII) posted for harassment
- Revenge attacks using fabricated explicit content
Political Disinformation
Deepfake audio and video of political figures saying things they never said. The MITRE ATT&CK framework doesn’t yet have a specific technique for deepfake-based social engineering, but it falls under T1566 (Phishing) and T1598 (Phishing for Information) when used for targeted attacks.
What You Can Do
For Individuals
- Establish a family passphrase — a code word shared only in person. If someone calls claiming to be a family member, ask for the passphrase.
- Verify financial requests through a separate channel — if the “CFO” asks for a wire transfer on a video call, call the CFO directly on their known phone number.
- Assume any unsolicited video call could be synthetic — especially if it involves urgent financial decisions.
- Minimize public video/audio — every public appearance is training data for a voice clone.
For Organizations
- Implement procedural controls for financial authorization — no wire transfer approved based solely on a video call, regardless of who appears to authorize it
- Dual authorization with out-of-band confirmation — two people must approve, and confirmation happens via a separate communication channel
- Train employees on deepfake awareness — the “this could be fake” instinct needs to become as automatic as “this email could be phishing”
For Developers
- Implement C2PA provenance in any media pipeline you build
- Use liveness detection in identity verification flows (not just “hold up your ID,” but challenge-response liveness)
- Don’t trust video as identity proof — video calls are no longer sufficient for high-value authorization
Sources & Further Reading
- FBI IC3: Deepfake Advisory — federal warnings on deepfake-enabled fraud
- C2PA (Content Provenance and Authenticity) — the provenance standard for verifiable media
- MITRE ATLAS: Adversarial ML — adversarial machine learning threat framework
- Intel FakeCatcher — real-time deepfake detection using biological signals
- Sensity AI — deepfake detection platform and research
- EFF: Deepfakes and Free Speech — legal and rights framework for synthetic media