Amazon’s new, next-gen Alexa Plus just issued an easily overlooked ace up your sleeve for Fire TV devices: You can describe a scene from a movie and zip directly to that spot in Prime Video. It’s the type of intuitive, time-saving control that immediately feels obvious — and it’s precisely what Google TV with Gemini should be scrambling to replicate.
An Effortless Voice Command That’s Like Magic
Talk about what’s happening on screen, quote a line, or mention a character, and Alexa Plus will find the timecode and bring you right to that exact beat.

The capability, Amazon claims, covers thousands of movies at launch; tens of thousands of scenes are indexed, and TV shows are also on their way to being supported. In earlier demos, Alexa also linked music to movies: You could recognize a song and swivel into a film or scene related to that tune — now the last piece of that process is live for consumers.
The lift is real here, too. Behind the scenes, you need scene-level embeddings across subtitles, scripts, audio and visual tokens, as well as entity recognition (characters, places, props) and alignment to timecodes down to a lip flap or whatever. It’s multimodal retrieval for casual entertainment, not a lab demo — and it works with natural language, not strict commands.
Why Google TV and Gemini Must Do the Same
Google has the makings of something similarly strong. Gemini is already good at semantic understanding, and Google’s ecosystem encompasses YouTube, Play Movies legacy libraries, and an aggregation layer that gets ahold of basically every major streaming app on Google TV. An AI-fueled scene index featuring Gemini would be nothing short of revolutionary: “Jump to the silent dinner argument,” “Find the training montage prior to the final fight,” or “I want every instance this actor shows up for the first time.”
“Consumers are spending more of their viewing time within streaming apps,” he said, “and precision navigation reduces friction.” Recent editions of Nielsen’s The Gauge have reliably shown streaming at about 38–40% of all TV usage in the US, putting the scale of the opportunity in perspective. The Motion Picture Association’s most recent THEME report puts the global total of online video subscriptions above 1.3 billion, a reminder that when discovery and control get better, it has an outsized impact at scale.
But novelty aside, the scene search serves real use:

- Finding a quote without having to scrub through a show
- Replaying moments from before a spoiler occurred
- Skipping past an intense sequence so kids can watch with no omissions in content or story progression
- Getting back to an emotional, beautiful, or scary sequence for subsequent film analysis
For a living room, voice is the perfect way to interact — no keyboard, no hunting through thumbnails, simply intent communicated in your own words.
The hurdles Google must overcome for scene search
There are obstacles. Aggregators such as Google TV also must work with streaming partners to surface “deep links” into specific timestamps. Not all services retain rich, time-coded metadata beyond subtitles, and there can be licensor restrictions that challenge indexing. Latency is an issue in the living room, and thus Gemini would require fast retrieval via on-device processing or an edge cache to feel instant on affordable hardware.
Such precision is as important to accuracy as to privacy. Poor scene retrieval with accents, paraphrased quotes, or vague prompts will annoy right away. And although such indexing of transcripts and frames is a technically achievable solution, partners will demand rigorous content protection as well as stringent limitations on data retention. These are solvable problems, but will take product, policy, and business alignment — it’s not just a model drop-in.
What a Gemini-powered version could improve across apps
Google could jump ahead by bringing scene search to all apps, rather than just one service, and adding context. Imagine: “Show every scene with this character where they talk about the heist plan,” “Skip to the reveal without spoilers past episode 3,” or “Find that scene everyone mentions in reviews.” With YouTube’s enormous collection and that built-in chapter metadata, Google could also connect films, trailers, interviews, and analysis, offering viewers a richer route through a title.
Closer coordination of devices would also aid. A Pixel phone or Nest speaker might pass along a query to a Chromecast, preloading results before you sit down so the TV jumps on as soon as you do. To be more accessible, you could put scene-level commands together with descriptive audio and a dynamic small-picture preview for better navigation by even more users.
Bottom line: why scene-level search will define TV control
With that simple move, Alexa Plus raised the bar of living-room AI by turning media navigation into an additional humanlike exercise in specificity and conversation. It’s practical, delightful, and — crucially — repeatable. If Google is serious about wanting Gemini to matter on the biggest screen in your home, this is the play to copy and expand. The platform that is the first to put scene-level search into every service will own the remote, and Amazon leads today by a nose.
