Models / Google
Chat
Code
Vision

Gemma 3n E4B Instruct

Selective parameter activation delivers 2B/4B multimodal performance on low-resource devices, handling text, image, video, audio

About model

Gemma 3n E4B Instruct generates human-like text based on input prompts, excelling at conversational dialogue and creative writing tasks, suitable for developers and researchers needing advanced language generation capabilities.

  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    google/gemma-3n-E4B-it

    curl -X POST "https://api.together.xyz/v1/chat/completions" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "google/gemma-3n-E4B-it",
        "messages": [
          {
            "role": "user",
            "content": "What are some fun things to do in New York?"
          }
        ]
    }'
    
    from together import Together
    
    client = Together()
    
    response = client.chat.completions.create(
      model="google/gemma-3n-E4B-it",
      messages=[
        {
          "role": "user",
          "content": "What are some fun things to do in New York?"
        }
      ]
    )
    print(response.choices[0].message.content)
    
    import Together from 'together-ai';
    const together = new Together();
    
    const completion = await together.chat.completions.create({
      model: 'google/gemma-3n-E4B-it',
      messages: [
        {
          role: 'user',
          content: 'What are some fun things to do in New York?'
         }
      ],
    });
    
    console.log(completion.choices[0].message.content);
    
  • Model card

    Gemma 3n Description

    Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices.

    They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

    Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain.

    Inputs and Outputs

    Input:

    • Text string, such as a question, a prompt, or a document to be summarized
    • Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
    • Audio data encoded to 6.25 tokens per second from a single channel
    • Total input context of 32K tokens

    Output:

    • Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
    • Total output length up to 32K tokens, subtracting the request input tokens

    Training Dataset

    These models were trained on a dataset that includes a wide variety of sources totalling approximately 11 trillion tokens. The knowledge cutoff date for the training data was June 2024.

    Key components:

    • Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
    • Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
    • Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
    • Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.
    • Audio: A diverse set of sound samples enables the model to recognize speech, transcribe text from recordings, and identify information in audio data.

    Data Preprocessing

    Key data cleaning and filtering methods applied to the training data:

    • CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
    • Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
    • Additional methods: Filtering based on content quality and safety in line with our policies.

    Implementation Information

    Hardware

    Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training generative models requires significant computational power. TPUs offer several advantages:

    • Performance: Specifically designed to handle the massive computations involved in training generative models
    • Memory: Large amounts of high-bandwidth memory for handling large models and batch sizes
    • Scalability: TPU Pods provide scalable solutions for handling growing complexity
    • Cost-effectiveness: More cost-effective solution compared to CPU-based infrastructure

    Software

    Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.

    Benchmark Metric n-shot E2B PT E4B PT
    HellaSwagAccuracy10-shot72.278.6
    BoolQAccuracy0-shot76.481.6
    PIQAAccuracy0-shot78.981.0
    SocialIQAAccuracy0-shot48.850.0
    TriviaQAAccuracy5-shot60.870.2
    Natural QuestionsAccuracy5-shot15.520.9
    ARC-cAccuracy25-shot51.761.6
    ARC-eAccuracy0-shot75.881.6
    WinoGrandeAccuracy5-shot66.871.7
    BIG-Bench HardAccuracyfew-shot44.352.9
    DROPToken F1 score1-shot53.960.8

    Intended Usage

    Content Creation and Communication

    • Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts
    • Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications
    • Text Summarization: Generate concise summaries of a text corpus, research papers, or reports
    • Image Data Extraction: Extract, interpret, and summarize visual data for text communications
    • Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data

    Research and Education

    • NLP Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques
    • Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice
    • Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics

    Limitations

    • Training Data: The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
    • Context and Task Complexity: Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
    • Language Ambiguity and Nuance: Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
    • Factual Accuracy: Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
    • Common Sense: Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.
Related models
  • Model provider
    Google
  • Type
    Chat
    Code
    Vision
  • Main use cases
    Chat
    Small & Fast
    Vision
  • Deployment
    Serverless
    On-Demand Dedicated
    Monthly Reserved
  • Parameters
    4B
  • Context length
    32K
  • Input price

    $0.02 / 1M tokens

  • Output price

    $0.04 / 1M tokens

  • Input modalities
    Text
    Image
  • Output modalities
    Text
  • Released
    June 3, 2025
  • Quantization level
    FP8
  • External link
  • Category
    Chat