برچسب: Kit

Text Recognition with ML Kit for Android: Getting Started [FREE]

Learn ML Kit’s text recognition to extract text from images, supporting features like search, translation, form entry automation, and content understanding.

Source link

17/06/2025
How Androidify leverages Gemini, Firebase and ML Kit
Posted by Thomas Ezan – Developer Relations Engineer, Rebecca Franks – Developer Relations Engineer, and Avneet Singh – Product Manager

We’re bringing back Androidify later this year, this time powered by Google AI, so you can customize your very own Android bot and share your creativity with the world. Today, we’re releasing a new open source demo app for Androidify as a great example of how Google is using its Gemini AI models to enhance app experiences.

In this post, we’ll dive into how the Androidify app uses Gemini models and Imagen via the Firebase AI Logic SDK, and we’ll provide some insights learned along the way to help you incorporate Gemini and AI into your own projects. Read more about the Androidify demo app.

App flow

The overall app functions as follows, with various parts of it using Gemini and Firebase along the way:

Gemini and image validation

To get started with Androidify, take a photo or choose an image on your device. The app needs to make sure that the image you upload is suitable for creating an avatar.

Gemini 2.5 Flash via Firebase helps with this by verifying that the image contains a person, that the person is in focus, and assessing image safety, including whether the image contains abusive content.
val jsonSchema = Schema.obj( properties = mapOf("success" to Schema.boolean(), "error" to Schema.string()), optionalProperties = listOf("error"), ) val generativeModel = Firebase.ai(backend = GenerativeBackend.googleAI()) .generativeModel( modelName = "gemini-2.5-flash-preview-04-17", generationConfig = generationConfig { responseMimeType = "application/json" responseSchema = jsonSchema }, safetySettings = listOf( SafetySetting(HarmCategory.HARASSMENT, HarmBlockThreshold.LOW_AND_ABOVE), SafetySetting(HarmCategory.HATE_SPEECH, HarmBlockThreshold.LOW_AND_ABOVE), SafetySetting(HarmCategory.SEXUALLY_EXPLICIT, HarmBlockThreshold.LOW_AND_ABOVE), SafetySetting(HarmCategory.DANGEROUS_CONTENT, HarmBlockThreshold.LOW_AND_ABOVE), SafetySetting(HarmCategory.CIVIC_INTEGRITY, HarmBlockThreshold.LOW_AND_ABOVE), ), ) val response = generativeModel.generateContent( content { text("You are to analyze the provided image and determine if it is acceptable and appropriate based on specific criteria.... (more details see the full sample)") image(image) }, ) val jsonResponse = Json.parseToJsonElement(response.text) val isSuccess = jsonResponse.jsonObject["success"]?.jsonPrimitive?.booleanOrNull == true val error = jsonResponse.jsonObject["error"]?.jsonPrimitive?.content
In the snippet above, we’re leveraging structured output capabilities of the model by defining the schema of the response. We’re passing a Schema object via the responseSchema param in the generationConfig.

We want to validate that the image has enough information to generate a nice Android avatar. So we ask the model to return a json object with success = true/false and an optional error message explaining why the image doesn’t have enough information.

Structured output is a powerful feature enabling a smoother integration of LLMs to your app by controlling the format of their output, similar to an API response.

Image captioning with Gemini Flash

Once it’s established that the image contains sufficient information to generate an Android avatar, it is captioned using Gemini 2.5 Flash with structured output.
val jsonSchema = Schema.obj( properties = mapOf( "success" to Schema.boolean(), "user_description" to Schema.string(), ), optionalProperties = listOf("user_description"), ) val generativeModel = createGenerativeTextModel(jsonSchema) val prompt = "You are to create a VERY detailed description of the main person in the given image. This description will be translated into a prompt for a generative image model..." val response = generativeModel.generateContent( content { text(prompt) image(image) }) val jsonResponse = Json.parseToJsonElement(response.text!!) val isSuccess = jsonResponse.jsonObject["success"]?.jsonPrimitive?.booleanOrNull == true val userDescription = jsonResponse.jsonObject["user_description"]?.jsonPrimitive?.content
The other option in the app is to start with a text prompt. You can enter in details about your accessories, hairstyle, and clothing, and let Imagen be a bit more creative.

Android generation via Imagen

We’ll use this detailed description of your image to enrich the prompt used for image generation. We’ll add extra details around what we would like to generate and include the bot color selection as part of this too, including the skin tone selected by the user.
val imagenPrompt = "A 3D rendered cartoonish Android mascot in a photorealistic style, the pose is relaxed and straightforward, facing directly forward [...] The bot looks as follows $userDescription [...]"
We then call the Imagen model to create the bot. Using this new prompt, we create a model and call generateImages:
// we supply our own fine-tuned model here but you can use "imagen-3.0-generate-002" val generativeModel = Firebase.ai(backend = GenerativeBackend.googleAI()).imagenModel( "imagen-3.0-generate-002", safetySettings = ImagenSafetySettings( ImagenSafetyFilterLevel.BLOCK_LOW_AND_ABOVE, personFilterLevel = ImagenPersonFilterLevel.ALLOW_ALL, ), ) val response = generativeModel.generateImages(imagenPrompt) val image = response.images.first().asBitmap()
And that’s it! The Imagen model generates a bitmap that we can display on the user’s screen.

Finetuning the Imagen model

The Imagen 3 model was finetuned using Low-Rank Adaptation (LoRA). LoRA is a fine-tuning technique designed to reduce the computational burden of training large models. Instead of updating the entire model, LoRA adds smaller, trainable “adapters” that make small changes to the model’s performance. We ran a fine tuning pipeline on the Imagen 3 model generally available with Android bot assets of different color combinations and different assets for enhanced cuteness and fun. We generated text captions for the training images and the image-text pairs were used to finetune the model effectively.

The current sample app uses a standard Imagen model, so the results may look a bit different from the visuals in this post. However, the app using the fine-tuned model and a custom version of Firebase AI Logic SDK was demoed at Google I/O. This app will be released later this year and we are also planning on adding support for fine-tuned models to Firebase AI Logic SDK later in the year.

The original image… and Androidifi-ed image

ML Kit

The app also uses the ML Kit Pose Detection SDK to detect a person in the camera view, which triggers the capture button and adds visual indicators.

To do this, we add the SDK to the app, and use PoseDetection.getClient(). Then, using the poseDetector, we look at the detectedLandmarks that are in the streaming image coming from the Camera, and we set the _uiState.detectedPose to true if a nose and shoulders are visible:
private suspend fun runPoseDetection() { PoseDetection.getClient( PoseDetectorOptions.Builder() .setDetectorMode(PoseDetectorOptions.STREAM_MODE) .build(), ).use { poseDetector -> // Since image analysis is processed by ML Kit asynchronously in its own thread pool, // we can run this directly from the calling coroutine scope instead of pushing this // work to a background dispatcher. cameraImageAnalysisUseCase.analyze { imageProxy -> imageProxy.image?.let { image -> val poseDetected = poseDetector.detectPersonInFrame(image, imageProxy.imageInfo) _uiState.update { it.copy(detectedPose = poseDetected) } } } } } private suspend fun PoseDetector.detectPersonInFrame( image: Image, imageInfo: ImageInfo, ): Boolean { val results = process(InputImage.fromMediaImage(image, imageInfo.rotationDegrees)).await() val landmarkResults = results.allPoseLandmarks val detectedLandmarks = mutableListOf<Int>() for (landmark in landmarkResults) { if (landmark.inFrameLikelihood > 0.7) { detectedLandmarks.add(landmark.landmarkType) } } return detectedLandmarks.containsAll( listOf(PoseLandmark.NOSE, PoseLandmark.LEFT_SHOULDER, PoseLandmark.RIGHT_SHOULDER), ) }
The camera shutter button is activated when a person (or a bot!) enters the frame.

Get started with AI on Android

The Androidify app makes an extensive use of the Gemini 2.5 Flash to validate the image and generate a detailed description used to generate the image. It also leverages the specifically fine-tuned Imagen 3 model to generate images of Android bots. Gemini and Imagen models are easily integrated into the app via the Firebase AI Logic SDK. In addition, ML Kit Pose Detection SDK controls the capture button, enabling it only when a person is present in front of the camera.

To get started with AI on Android, go to the Gemini and Imagen documentation for Android.

Explore this announcement and all Google I/O 2025 updates on io.google starting May 22.
Source link
23/05/2025

On-device GenAI APIs as part of ML Kit help you easily build with Gemini Nano

Posted by Caren Chang – Developer Relations Engineer, Chengji Yan – Software Engineer, Taj Darra – Product Manager

We are excited to announce a set of on-device GenAI APIs, as part of ML Kit, to help you integrate Gemini Nano in your Android apps.

To start, we are releasing 4 new APIs:

Summarization: to summarize articles and conversations
Proofreading: to polish short text
Rewriting: to reword text in different styles
Image Description: to provide short description for images

Key benefits of GenAI APIs

GenAI APIs are high level APIs that allow for easy integration, similar to existing ML Kit APIs. This means you can expect quality results out of the box without extra effort for prompt engineering or fine tuning for specific use cases.

GenAI APIs run on-device and thus provide the following benefits:

Input, inference, and output data is processed locally
Functionality remains the same without reliable internet connection
No additional cost incurred for each API call

To prevent misuse, we also added safety protection in various layers, including base model training, safety-aware LoRA fine-tuning, input and output classifiers and safety evaluations.

How GenAI APIs are built

There are 4 main components that make up each of the GenAI APIs.

Gemini Nano is the base model, as the foundation shared by all APIs.
Small API-specific LoRA adapter models are trained and deployed on top of the base model to further improve the quality for each API.
Optimized inference parameters (e.g. prompt, temperature, topK, batch size) are tuned for each API to guide the model in returning the best results.
An evaluation pipeline ensures quality in various datasets and attributes. This pipeline consists of: LLM raters, statistical metrics and human raters.

Together, these components make up the high-level GenAI APIs that simplify the effort needed to integrate Gemini Nano in your Android app.

Evaluating quality of GenAI APIs

For each API, we formulate a benchmark score based on the evaluation pipeline mentioned above. This score is based on attributes specific to a task. For example, when evaluating the summarization task, one of the attributes we look at is “grounding” (ie: factual consistency of generated summary with source content).

To provide out-of-box quality for GenAI APIs, we applied feature specific fine-tuning on top of the Gemini Nano base model. This resulted in an increase for the benchmark score of each API as shown below:

Use case in English	Gemini Nano Base Model	ML Kit GenAI API
Summarization	77.2	92.1
Proofreading	84.3	90.2
Rewriting	79.5	84.1
Image Description	86.9	92.3

In addition, this is a quick reference of how the APIs perform on a Pixel 9 Pro:

	Prefix Speed (input processing rate)	Decode Speed (output generation rate)
Text-to-text	510 tokens/second	11 tokens/second
Image-to-text	510 tokens/second + 0.8 seconds for image encoding	11 tokens/second

Sample usage

This is an example of implementing the GenAI Summarization API to get a one-bullet summary of an article:

val articleToSummarize = "We are excited to announce a set of on-device generative AI APIs..."

// Define task with desired input and output format
val summarizerOptions = SummarizerOptions.builder(context)
    .setInputType(InputType.ARTICLE)
    .setOutputType(OutputType.ONE_BULLET)
    .setLanguage(Language.ENGLISH)
    .build()
val summarizer = Summarization.getClient(summarizerOptions)

suspend fun prepareAndStartSummarization(context: Context) {
    // Check feature availability. Status will be one of the following: 
    // UNAVAILABLE, DOWNLOADABLE, DOWNLOADING, AVAILABLE
    val featureStatus = summarizer.checkFeatureStatus().await()

    if (featureStatus == FeatureStatus.DOWNLOADABLE) {
        // Download feature if necessary.
        // If downloadFeature is not called, the first inference request will 
        // also trigger the feature to be downloaded if it's not already
        // downloaded.
        summarizer.downloadFeature(object : DownloadCallback {
            override fun onDownloadStarted(bytesToDownload: Long) { }

            override fun onDownloadFailed(e: GenAiException) { }

            override fun onDownloadProgress(totalBytesDownloaded: Long) {}

            override fun onDownloadCompleted() {
                startSummarizationRequest(articleToSummarize, summarizer)
            }
        })    
    } else if (featureStatus == FeatureStatus.DOWNLOADING) {
        // Inference request will automatically run once feature is      
        // downloaded.
        // If Gemini Nano is already downloaded on the device, the   
        // feature-specific LoRA adapter model will be downloaded very  
        // quickly. However, if Gemini Nano is not already downloaded, 
        // the download process may take longer.
        startSummarizationRequest(articleToSummarize, summarizer)
    } else if (featureStatus == FeatureStatus.AVAILABLE) {
        startSummarizationRequest(articleToSummarize, summarizer)
    } 
}

fun startSummarizationRequest(text: String, summarizer: Summarizer) {
    // Create task request  
    val summarizationRequest = SummarizationRequest.builder(text).build()

    // Start summarization request with streaming response
    summarizer.runInference(summarizationRequest) { newText -> 
        // Show new text in UI
    }

    // You can also get a non-streaming response from the request
    // val summarizationResult = summarizer.runInference(summarizationRequest)
    // val summary = summarizationResult.get().summary
}

// Be sure to release the resource when no longer needed
// For example, on viewModel.onCleared() or activity.onDestroy()
summarizer.close()

For more examples of implementing the GenAI APIs, check out the official documentation and samples on GitHub:

Use cases

Here is some guidance on how to best use the current GenAI APIs:

For Summarization, consider:

Conversation messages or transcripts that involve 2 or more users

Articles or documents less than 4000 tokens (or about 3000 English words). Using the first few paragraphs for summarization is usually good enough to capture the most important information.

For Proofreading and Rewriting APIs, consider utilizing them during the content creation process for short content below 256 tokens to help with tasks such as:

Refining messages in a particular tone, such as more formal or more casual

Polishing personal notes for easier consumption later

For the Image Description API, consider it for:

Generating titles of images

Generating metadata for image search

Utilizing descriptions of images in use cases where the images themselves cannot be displayed, such as within a list of chat messages

Generating alternative text to help visually impaired users better understand content as a whole

GenAI API in production

Envision is an app that verbalizes the visual world to help people who are blind or have low vision lead more independent lives. A common use case in the app is for users to take a picture to have a document read out loud. Utilizing the GenAI Summarization API, Envision is now able to get a concise summary of a captured document. This significantly enhances the user experience by allowing them to quickly grasp the main points of documents and determine if a more detailed reading is desired, saving them time and effort.

side by side images of a mobile device showing a document on a table on the left, and the results of the scanned document on the right showing details providing the what, when, and where as written in the document

Supported devices

GenAI APIs are available on Android devices using optimized MediaTek Dimensity, Qualcomm Snapdragon, and Google Tensor platforms through AICore. For a comprehensive list of devices that support GenAI APIs, refer to our official documentation.

Learn more

Start implementing GenAI APIs in your Android apps today with guidance from our official documentation and samples on GitHub: AI Catalog GenAI API Samples with Compose, ML Kit GenAI APIs Quickstart.

Source link

21/05/2025

برچسب: Kit

Text Recognition with ML Kit for Android: Getting Started [FREE]

How Androidify leverages Gemini, Firebase and ML Kit

App flow

Gemini and image validation

Image captioning with Gemini Flash

Android generation via Imagen

Finetuning the Imagen model

ML Kit

Get started with AI on Android