Models / Google

Chat

Code

Vision

Gemma 3n E4B Instruct

Selective parameter activation delivers 2B/4B multimodal performance on low-resource devices, handling text, image, video, audio

Try now

read docs

About model

Gemma 3n E4B Instruct generates human-like text based on input prompts, excelling at conversational dialogue and creative writing tasks, suitable for developers and researchers needing advanced language generation capabilities.

Quickstart guides

RAG

Building a RAG Workflow

Agents

Agent Workflows

Apps

Next.js Chat Quickstart

API usage

cURL
Python
Typescript

Endpoint:

google/gemma-3n-E4B-it

curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3n-E4B-it",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
}'

from together import Together

client = Together()

response = client.chat.completions.create(
  model="google/gemma-3n-E4B-it",
  messages=[
    {
      "role": "user",
      "content": "What are some fun things to do in New York?"
    }
  ]
)
print(response.choices[0].message.content)

import Together from 'together-ai';
const together = new Together();

const completion = await together.chat.completions.create({
  model: 'google/gemma-3n-E4B-it',
  messages: [
    {
      role: 'user',
      content: 'What are some fun things to do in New York?'
     }
  ],
});

console.log(completion.choices[0].message.content);

Model card

Gemma 3n Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices.

They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain.

Inputs and Outputs

Input:

Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
Audio data encoded to 6.25 tokens per second from a single channel
Total input context of 32K tokens

Output:

Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output length up to 32K tokens, subtracting the request input tokens

Training Dataset

These models were trained on a dataset that includes a wide variety of sources totalling approximately 11 trillion tokens. The knowledge cutoff date for the training data was June 2024.

Key components:

Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.
Audio: A diverse set of sound samples enables the model to recognize speech, transcribe text from recordings, and identify information in audio data.

Data Preprocessing

Key data cleaning and filtering methods applied to the training data:

CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
Additional methods: Filtering based on content quality and safety in line with our policies.

Implementation Information

Hardware

Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training generative models requires significant computational power. TPUs offer several advantages:

Performance: Specifically designed to handle the massive computations involved in training generative models
Memory: Large amounts of high-bandwidth memory for handling large models and batch sizes
Scalability: TPU Pods provide scalable solutions for handling growing complexity
Cost-effectiveness: More cost-effective solution compared to CPU-based infrastructure

Software

Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.

Benchmark	Metric	n-shot	E2B PT	E4B PT
HellaSwag	Accuracy	10-shot	72.2	78.6
BoolQ	Accuracy	0-shot	76.4	81.6
PIQA	Accuracy	0-shot	78.9	81.0
SocialIQA	Accuracy	0-shot	48.8	50.0
TriviaQA	Accuracy	5-shot	60.8	70.2
Natural Questions	Accuracy	5-shot	15.5	20.9
ARC-c	Accuracy	25-shot	51.7	61.6
ARC-e	Accuracy	0-shot	75.8	81.6
WinoGrande	Accuracy	5-shot	66.8	71.7
BIG-Bench Hard	Accuracy	few-shot	44.3	52.9
DROP	Token F1 score	1-shot	53.9	60.8

Intended Usage

Content Creation and Communication

Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts
Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications
Text Summarization: Generate concise summaries of a text corpus, research papers, or reports
Image Data Extraction: Extract, interpret, and summarize visual data for text communications
Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data

Research and Education

NLP Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques
Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice
Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics

Limitations

Training Data: The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
Context and Task Complexity: Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
Language Ambiguity and Nuance: Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
Factual Accuracy: Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
Common Sense: Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

Related models

Model specifications

Model data

Model provider
Google
Type
Chat
Code
Vision
Main use cases
Chat
Small & Fast
Vision
Deployment
Serverless
On-Demand Dedicated
Monthly Reserved
Endpoint
google/gemma-3n-E4B-it
Parameters
4B
Context length
32K
Input price
$0.06 / 1M tokens
Output price
$0.12 / 1M tokens
Input modalities
Text
Image
Output modalities
Text

Released
June 3, 2025
Quantization level
FP8
External link
Provider docs
Category
Chat

Run in Playground

Quickstart docs

Deploy model