Analyze Images, Videos, Documents & Audio with Gemini Tools and Qwen LLM Agent

Name: Analyze Images, Videos, Documents & Audio with Gemini Tools and Qwen LLM Agent
Availability: InStock
Rating: 0.4 (1 reviews)
Author: Mauricio Perera

📁 Analyze uploaded images, videos, audio, and documents with specialized tools — powered by a lightweight language-only agent.

🧭 What It Does

This workflow enables multimodal file analysis using Google Gemini tools connected to a text-only LLM agent. Users can upload images, videos, audio files, or documents via a chat interface. The workflow will:

Upload each file to Google Gemini and obtain an accessible URL. Dynamically generate contextual prompts based on the file(s) and user message. Allow the agent to invoke Gemini tools for specific media types as needed. Return a concise, helpful response based on the analysis.

🚀 Use Cases

Customer support**: Let users upload screenshots, documents, or recordings and get helpful insights or summaries. Multimedia QA**: Review visual, audio, or video content for correctness or compliance. Educational agents**: Interpret content from PDFs, diagrams, or audio recordings on the fly. Low-cost multimodal assistants: Achieve multimodal functionality **without relying on large vision-language models.

🎯 Why This Architecture Matters

Unlike end-to-end multimodal LLMs (like Gemini 1.5 or GPT-4o), this template:

Uses a text-only LLM (Qwen 32B via Groq) for reasoning. Delegates media analysis to specialized Gemini tools.

✅ Advantages

| Feature | Benefit | | ----------------------- | --------------------------------------------------------------------- | | 🧩 Modular | LLM + Tools are decoupled; can update them independently | | 💸 Cost-Efficient | No need to pay for full multimodal models; only use tools when needed | | 🔧 Tool-based Reasoning | Agent invokes tools on demand, just like OpenAI’s Toolformer setup | | ⚡ Fast | Groq LLMs offer ultra-fast responses with low latency | | 📚 Memory | Includes context buffer for multi-turn chats (15 messages) |

🧪 How It Works

🔹 Input via Chat

Users submit a message and (optionally) files via the chatTrigger.

🔹 File Handling

If no files: prompt is passed directly to the agent. If files are included:

Files are split, uploaded to Gemini (to get public URLs). Metadata (name, type, URL) is collected and embedded into the prompt.

🔹 Prompt Construction

A new chatInput is dynamically generated:

User message

Media: [array of file data] 🔹 Agent Reasoning

The Langchain Agent receives:

The enriched prompt File URLs Memory context (15 turns) Access to 4 Gemini tools:

IMG: analyze image
VIDEO: analyze video
AUDIO: analyze audio
DOCUMENT: analyze document

The agent autonomously decides whether and how to use tools, then responds with concise output.

🧱 Nodes & Services

| Category | Node / Tool | Purpose | | --------------- | ---------------------------- | ------------------------------------- | | Chat Input | chatTrigger | User interface with file support | | File Processing | splitOut, splitInBatches | Process each uploaded file | | Upload | googleGemini | Uploads each file to Gemini, gets URL | | Metadata | set, aggregate | Builds structured file info | | AI Agent | Langchain Agent | Receives context + file data | | Tools | googleGeminiTool | Analyze media with Gemini | | LLM | lmChatGroq (Qwen 32B) | Text reasoning, high-speed | | Memory | memoryBufferWindow | Maintains session context |

⚙️ Setup Instructions

🔑 Required Credentials

Groq API key** (for Qwen 32B model) Google Gemini API key** (Palm / Gemini 1.5 tools)

🧩 Nodes That Need Setup

Replace existing credentials on:

Upload a file Each GeminiTool (IMG, VIDEO, AUDIO, DOCUMENT) lmChatGroq

⚠️ File Size & Format Considerations

Some Gemini tools have file size or format restrictions. You may add validation nodes before uploading if needed.

🛠️ Optional Improvements

Add logging and error handling (e.g., for upload failures). Add MIME-type filtering to choose the right tool explicitly. Extend to include OCR or transcription services pre-analysis. Integrate with Slack, Telegram, or WhatsApp for chat delivery.

🧪 Example Use Case

> "Hola, ¿qué dice este PDF?"

Uploads a document → Agent routes it to Gemini DOCUMENT tool → Receives extracted content → LLM summarizes it in Spanish.

🧰 Tags

multimodal, agent, langchain, groq, gemini, image analysis, audio analysis, document parsing, video analysis, file uploader, chat assistant, LLM tools, memory, AI tools

📂 Files

This template is ready to use as-is in n8n. No external webhooks or integrations required.

0

Downloads

1

Views

8.14

Quality Score

beginner

Complexity

Category:Content Management

Author:Mauricio Perera(View Original →)

Created:8/13/2025

Updated:11/17/2025

Related Templates

Track Demo Bookings with Google Calendar to Meta Conversions API Integration

Who is this workflow for? If you're using Meta Ads to generate new leads to your sales pipeline, this workflow is for yo...

Content Management1 downloads

Transcribe & Summarize Audio with Whisper and GPT, from Google Drive to Notion

This workflow contains community nodes that are only compatible with the self-hosted version of n8n. Overview This work...

Content Management1 downloads

Build an MCP Server which answers questions with Retrieval Augmented Generation

Build an MCP Server which has access to a semantic database to perform Retrieval Augmented Generation (RAG) Tutorial C...

Analyze Images, Videos, Documents & Audio with Gemini Tools and Qwen LLM Agent

Tags

Related Templates

Track Demo Bookings with Google Calendar to Meta Conversions API Integration

Transcribe & Summarize Audio with Whisper and GPT, from Google Drive to Notion

Build an MCP Server which answers questions with Retrieval Augmented Generation

Workflow Visualization

Loading...

Comments (0)