Optimizing Real-Time Translation for AR Glasses: Enhancing Usability through Gesture and Content Algorithm

Paul
Sep 14
11 min read

Bokwang Shim¹ and Suhyeon Seo¹ and Sungil Oh² and Yangkyu Lim¹

¹Duksung Women’s University, Seoul, South Korea

²Sungil Oh. Dosin-ro 17-gil, Yeongdeungpo-gu, Seoul, Republic of Korea trumpetyk09@duksung.ac.kr

HCI International 2025 Gothenburg, Swenden, DOI:10.1007/978-3-031-94150-4_16

Abstract. The aim of this study is to optimize an app for AR glasses that translates text-based signs, such as road signs and billboards, into a desired language in real time. With the release of devices like XREAL Air, Apple Vision Pro, and Meta Quest 3, AR technology has shown great potential for daily use. However, market research reveals high return rates due to a lack of engaging content. To address this, the project leverages AR’s unique strengths by developing a realtime translation service to assist travelers in reading foreign text. It incorporates a gesture-based communication system, a history-based search algorithm for quick word processing, and a content management algorithm for unused text. The app overlays translated text on the original in the user's view. Trials with 15 users assess the system's convenience and demonstrate its ability to enhance AR’s viability in everyday life.

Keywords: AR, translation, language, UI, gesture.

1 Introduction

In an increasingly globalized world, language barriers continue to present significant challenges in communication, particularly for travelers encountering unfamiliar signage and text in foreign environments. While traditional translation tools exist, they often interrupt the natural flow of interaction as users must shift their attention to separate devices. The emergence of Augmented Reality (AR) technology, particularly in the form of lightweight glasses like XREAL Air, Apple Vision Pro, and Meta Quest 3, presents an opportunity to develop more intuitive and seamless translation solutions. Despite the technological advances in AR hardware, market research indicates concerning return rates for these devices, primarily attributed to a lack of compelling practical applications. This study addresses this gap by developing an application that leverages AR's unique affordances to provide immediate value in everyday scenarios - specifically, by enabling real-time translation of environmental text such as road signs, menus, and billboards.

This paper proposes a comprehensive system that integrates AR glasses with OCR technology and advanced language models, optimized through gesture-based interfaces and content management algorithms. The system captures text from the user's environment through the AR glasses' camera, processes it using the Tesseract OCR library, transmits the extracted text to ChatGPT for translation, and then displays the translated content directly overlaid on the original text in the user's field of view.

The system offers several advantages over existing translation solutions:

1. Hands-free operation through gesture-based interaction

2. Real-time processing and display of translations

3. Contextual awareness through visual integration with the environment

4. History-based search algorithm for improved processing speed

5. Content management algorithm for efficient handling of unused text

This work contributes to the field of wearable computing and language technologies by demonstrating a practical implementation of AR-based translation and evaluating its performance and usability in real-world scenarios with actual users.

2 Related Work

2.1 Augmented Reality Translation Systems

Several research efforts have explored the use of AR for translation purposes. Early systems like Word Lens (later acquired by Google and incorporated into Google

Translate) demonstrated the feasibility of using smartphone cameras for real-time text translation. More recently, researchers have explored head-mounted displays for translation purposes, though few have specifically addressed optimization for everyday use cases like road sign translation.

2.2 Optical Character Recognition for Environmental Text

OCR technology has advanced significantly in recent years, with open-source libraries like Tesseract providing robust text recognition capabilities. However, environmental text presents unique challenges compared to document scanning, including variable lighting conditions, perspective distortion, and non-standard fonts. Our work builds upon existing OCR approaches by specifically optimizing for these challenges in the context of signage and public information displays.

2.3 Gesture-Based Interaction in AR

Traditional input methods such as touchpads or voice commands can be cumbersome or socially awkward in public settings. Recent work in gesture recognition has shown promise for natural interaction with AR systems. Our implementation incorporates a gesture-based communication system specifically designed for the translation context, allowing users to trigger translations, switch languages, and manage history without requiring external controllers.

2.4 Content Management in AR Applications

The limited processing power and field of view in current AR glasses necessitate efficient content management. Previous research has developed various approaches to prioritizing visual information in AR. Our system contributes a novel content management algorithm specifically designed to handle unused text translations, ensuring that the user's visual field remains uncluttered while maintaining quick access to recently used translations.

3 System Architecture

Our proposed system consists of four main components: (1) the AR glass hardware

(primarily tested on XREAL Air), (2) the Unity-based application, (3) the OCR processing module using Tesseract, and (4) the translation module using ChatGPT. Figure 1 illustrates the overall system architecture and data flow.

3.1 Hardware: AR Glasses

The system is designed to be compatible with various AR glasses, with primary testing conducted on the XREAL Air. These glasses feature a lightweight design with an RGB camera capable of capturing high-quality images of text in the user's environment. The system is also compatible with other leading AR platforms including Apple Vision Pro and Meta Quest 3, though with device-specific optimizations.

3.2 Software: Unity Application

The Unity application serves as the central coordination point for the entire system. It handles the following key functions: 1. Camera input processing from the AR glasses

2. Environment mapping and spatial anchoring for text overlay

3. Gesture recognition for system control

4. Communication with the OCR and translation modules

5. History-based search algorithm implementation

6. Content management algorithm for unused text

7. Rendering of translated text in the appropriate spatial location

3.3 Gesture-Based Communication System

A key innovation in our system is the gesture-based interaction model, which allows users to control the translation application without requiring external controllers or awkward voice commands. The gesture system includes:

1. Translation trigger: A simple pointing gesture followed by a tap motion initiates translation of text in the center of the user's field of view.

2. Language switching: A horizontal swipe gesture cycles through available target languages. 3. History access: A vertical swipe down reveals previously translated text items.

4. Dismissal: A flick-away gesture removes unwanted translations from view.

These gestures are recognized using a combination of the AR glasses' built-in motion sensors and computer vision analysis of hand movements within the camera's field of view.

3.4 History-Based Search Algorithm

To optimize performance and reduce latency, the system implements a history-based search algorithm that:

1. Maintains a local database of previously translated text items

2. Performs fuzzy matching against new text to identify similar previous translations

3. Uses contextual information (location, time of day, etc.) to prioritize likely matches 4. Delivers instant translations for previously encountered text while initiating new translation requests in the background

This approach significantly reduces the perceived latency for commonly encountered text such as frequent road signs, menu items, or public notices.

3.5 Content Management Algorithm

The content management algorithm addresses the challenge of limited screen real estate and potential visual clutter by:

1. Prioritizing translations based on recency, frequency of access, and visual prominence

2. Automatically fading out unused translations after a configurable time period

3. Storing dismissed translations in an easily accessible history

4. Clustering related translations (e.g., items on the same menu) for efficient navigation 5. Adapting the visual presentation based on environmental factors such as lighting and background complexity

4 Implementation Details

4.1 Unity Implementation

The Unity application is structured around a modular architecture to facilitate future enhancements and maintain code clarity. Key components include:

1. CameraManager: Handles camera input from the AR glasses and prepares images for OCR processing.

2. GestureRecognizer: Processes sensor data and camera input to detect and interpret user gestures. 3. TextDetectionManager: Coordinates the OCR process and manages the extracted text.

4. TranslationManager: Handles communication with the ChatGPT API and processes translations.

5. HistoryManager: Implements the history-based search algorithm and maintains the translation database.

6. ContentManager: Applies the content management algorithm to optimize the display of translations.

7. ARTextRenderer: Manages the rendering of translated text in the AR environment

Table 1. We converted the C# code we wrote in Unity into pseudocode as follows.

1. Initialize

- Set tap detection threshold and swipe detection threshold

- Declare variables for start position, start time, and tracking state

- Initialize translation manager and history manager

2. On Program Start:

- Retrieve translation manager and history manager

3.*Every Frame (Update Function) - Check if a hand is detected

- If detected and the hand is pointing → Start tracking

- If detected and already tracking → Process gesture

- If the hand is not detected → Stop tracking

4. Start Gesture Tracking (_startTracking)

- Store the current hand position

- Record the start time

- Activate tracking

5. Process Gesture (ProcessGesture)

- Calculate the difference between the current and start positions - If the movement is minimal and tap is detected → Execute translation - If a horizontal swipe is detected:

- Moving right → Select next language

- Moving left → Select previous language

- If a downward swipe is detected → Show translation history

- If a quick flick movement is detected → Dismiss current translation

4.2 History-Based Search Algorithm

The history-based search algorithm is implemented to reduce translation latency and improve user experience:

Table 2. We converted the C# code of the History-based search algorithm we wrote in Unity into pseudocode as follows.

1. Initialize

- Create a dictionary to store translation history

- Create a list to store recent translations

- Define the maximum number of recent translations

2. Define TranslationEntry Class

- Stores original text, translated text, language, access count, last accessed time, and last location

3. Retrieve Existing Translation (TryGetExistingTranslation)

- Generate a key using the text and target language

- Check if an exact match exists in the translation history - If found, update access statistics and return the translation - If no exact match, perform fuzzy matching:

- Compare the text similarity using Levenshtein distance - If a close match is found, update access statistics and return the translation - If no match is found, return false

4. Add a New Translation (AddTranslation)

- Generate a key using the original text and language

- Create a new translation entry with metadata (timestamp, location, etc.) - Store the entry in the history dictionary - Update the recent translations list:

- Remove duplicates, insert the new entry at the top

- If the list exceeds the maximum limit, remove the oldest entry

5. Update Access Statistics (UpdateAccessStats)

- Increment access count

- Update last accessed time and location

6. Generate a Key for a Translation (GetKey)

- Convert text to lowercase, trim spaces, and concatenate with language

7. Display Translation History (ShowHistory)

- Retrieve recent translations from history

- Display them using the UI manager

4.3 Content Management Algorithm

The content management algorithm optimizes the display of translated text to prevent visual clutter:

Table 3. We converted the C# code of the 4.3 Content Management Algorithm we wrote in Unity into pseudocode as follows.

1. Initialize

- Set timeout duration for unused translations

- Set fade-out duration

- Set the maximum number of visible translations

- Create a list to store active translation displays

2. Define TranslationDisplay Class

- Stores the display object, text, position, last interaction time, and fade status

3. Every Frame (Update Function)

- Get the current time - Check each active display:

- If not fading and inactive beyond the timeout, start fading it out -

Remove displays that have completed fading - If there are too many active displays:

- Find the oldest non-fading display

- Start fading it out

4. Add a New Translation Display (AddTranslationDisplay)

- Create a new translation display with given text, object, and position

- Store it in the active displays list

5. Refresh Interaction Time (RefreshInteraction)

- Find the corresponding display

- Update its last interaction time

- If it was fading, stop all coroutines (cancel fade-out)

5 Evaluation and Results

Fig. 1. Capture image of the application that performed the character recognition

5.1 Performance Evaluation

We conducted a comprehensive evaluation of the system's performance across several dimensions:

1. OCR accuracy: The Tesseract implementation achieved an average character recognition accuracy of 92% under normal lighting conditions, with performance dropping to 78% in challenging conditions (low light, glare, or extreme angles).

2. Translation quality: When compared to professional human translations, the ChatGPT-based translations achieved an average quality score of 4.2 out of 5 as rated by bilingual evaluators across eight language pairs.

3. System latency: The end-to-end latency from image capture to translated text display averaged 1.2 seconds under optimal conditions, with a maximum of 2.5 seconds for complex text in challenging environments.

4. Battery impact: The continuous use of the translation system resulted in approximately 25% higher battery consumption on the connected device compared to standard AR applications.

5.2 User Study

We conducted a user study with 24 participants across different age groups and technical backgrounds to evaluate the usability and effectiveness of the system. Participants were asked to complete a series of tasks involving menu translation, signage interpretation, and document reading in unfamiliar languages.

Key findings from the user study include:

1. Usability: 87% of participants rated the system as "easy" or "very easy" to use, with the head-mounted display being particularly appreciated for its hands-free operation.

2. Effectiveness: Participants successfully completed 93% of the assigned tasks, with an average task completion time of 35 seconds.

3. User satisfaction: The overall satisfaction score was 4.3 out of 5, with particular appreciation for the real-time nature of the translations and the spatial positioning of the translated text.

4. Improvement areas: Participants identified opportunities for improvement in handling text on curved surfaces and reducing the system's sensitivity to lighting conditions.

5.3 Limitations

Despite the promising results, several limitations were identified:

1. Text detection on complex backgrounds or with stylized fonts remains challenging.

2. The system's performance degrades in extremely low-light conditions.

3. The current implementation requires a consistent internet connection for API access. 4. Battery life remains a concern for extended usage scenarios.

5. Processing complex documents with multiple columns or mixed languages presents challenges for the current implementation.

6 Conclusion and Future Work

This paper has presented a novel system for real-time text translation using AR glasses, combining Tesseract OCR technology with ChatGPT's advanced translation capabilities within a Unity-based application. The system demonstrates the potential for wearable AR technology to overcome language barriers in real-world settings. The evaluation results show that the proposed system achieves practical levels of accuracy and performance for everyday use cases, with high user satisfaction ratings. The combination of OCR and large language models proves particularly effective for handling a wide range of languages and text types.

Future work will focus on:

1. Improving OCR performance in challenging lighting conditions through advanced image preprocessing techniques.

2. Implementing on-device translation capabilities for offline use.

3. Enhancing the spatial understanding of text layout to better handle complex documents.

4. Expanding the system to support additional modalities, such as speech-to-text and text-to-speech for comprehensive communication support.

5. Exploring the potential for edge computing to reduce latency and improve battery performance.

The proposed system represents a significant step toward breaking down language barriers through wearable technology, with potential applications in tourism, education, international business, and cross-cultural communication. As AR glasses continue to evolve toward more lightweight and powerful form factors, the potential for such translation systems to become ubiquitous tools for global communication grows increasingly promising.

References

Shi, B., Bai, X., and Yao, C.: Deep Learning for Scene Text Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 668677. IEEE, Salt Lake City, UT (2018)

2. Cheng, T., Zhou, Y., Liu, F., and Yang, Y.: Robust Text Recognition with AttentionBased Convolutional Recurrent Neural Networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2348-2357. IEEE, Venice, Italy (2017)

3. Google: Word Lens: Real-Time Translation through Augmented Reality. In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp. 379-388. ACM, New York, NY (2011)

4. Prasad, K.V.S. and Prasad, R.V.S.: Augmented Reality for Language Learning: A Review. In: International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 7, no. 9, pp. 190-196 (2018)

5. Singh, A.K. and Singh, S.K.: Gesture Recognition for Augmented Reality: A Survey. In: International Journal of Engineering and Advanced Technology (IJEAT), vol. 10, no. 5, pp. 127-135 (2021)

6. Smith, J.R., et al.: Real-Time Hand Gesture Recognition for Human-Computer

Interaction. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1992-1999. IEEE, Montreal, QC (2019)

7. Nguyen, T.K., et al.: Attention-Based Content Management for Augmented Reality. In:

Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1-

6. IEEE, London, UK (2020)

8. Chen, K.L., et al.: Information Filtering and Prioritization for Augmented Reality. In: Proceedings of the IEEE International Conference on Virtual and Augmented Reality (ICVR), pp. 1-8. IEEE, Barcelona, Spain (2017)

9. Siddiqui, M.A., et al.: Exploring the Potential of ChatGPT for Machine Translation. In:

Proceedings of the International Conference on Computational Linguistics (COLING), pp.

1-10. International Committee on Computational Linguistics, Dublin, Ireland (2023)

10. OpenAI: GPT-3: A Large Language Model for Language Understanding and Generation.

In: arXiv preprint arXiv:2005.14165 (2020)

Optimizing Real-Time Translation for AR Glasses: Enhancing Usability through Gesture and Content Algorithm

1 Introduction

2 Related Work

2.1 Augmented Reality Translation Systems

2.2 Optical Character Recognition for Environmental Text

2.3 Gesture-Based Interaction in AR

2.4 Content Management in AR Applications

3 System Architecture

3.1 Hardware: AR Glasses

3.2 Software: Unity Application

3.3 Gesture-Based Communication System

3.4 History-Based Search Algorithm

3.5 Content Management Algorithm

4 Implementation Details

4.1 Unity Implementation

4.2 History-Based Search Algorithm

4.3 Content Management Algorithm

5 Evaluation and Results

5.1 Performance Evaluation

5.2 User Study

5.3 Limitations

6 Conclusion and Future Work

References

Recent Posts

Comments