MagicExplorer: An Automated Browser Agent System with Cangjie Magic Framework
As interns at Huawei, we embarked on an ambitious journey to create MagicExplorer, an intelligent automated browser agent system that could navigate and interact with web applications autonomously. This project became a testament to iterative design, architectural evolution, and the powerful capabilities of Huawei’s Cangjie Magic framework.
The Evolution of Architecture: From Complexity to Elegance
Phase 1: The FFI Foundation
Our initial approach leveraged Foreign Function Integration (FFI) to execute Python code from within our Cangjie environment. This allowed us to harness the robust Python Playwright package through carefully crafted wrapper functions. Playwright provided the essential DOM manipulation capabilities - clicking elements, scraping content, and basic navigation.
The first iteration was remarkably simple: we would prompt our agent to identify which button to click, manually feed this information back to Playwright, and observe the results. While this proved the concept, the limitations were immediately apparent. The system lacked flexibility for dynamic actions like scrolling or adaptive navigation, creating a rigid interaction model that couldn’t handle real-world web complexity.
Phase 2: Tool-Based Single Agent
Recognizing these constraints, we evolved to a tool-based architecture where our Cangjie agent had direct access to click and navigation tools. This version represented a significant leap forward - we could prompt the system with a goal once and watch it autonomously navigate toward completion.
However, new challenges emerged. The agent struggled with memory retention, often repeating previous actions in an endless loop, expecting different results. Goal verification became problematic, with the system unable to accurately determine task completion. Additionally, the FFI layer introduced significant performance bottlenecks, making the entire system frustratingly slow.
Phase 3: Multi-Agent Architecture
Faced with these limitations, we returned to first principles. Our analysis revealed that overwhelming a single agent with multiple responsibilities was counterproductive. We designed a two-agent system:
-
The Planner: Responsible for breaking down goals into actionable steps
-
The Executor: Focused on implementing the planner’s instructions
Simultaneously, we made a crucial technological shift. We rewrote our Playwright functions in JavaScript and integrated them via Model Context Protocol (MCP). This is where Cangjie Magic truly began to shine - the framework provided seamless interaction capabilities with our JavaScript tools, dramatically improving performance and flexibility.
Phase 4: The Vision Experiment
Despite improvements, we encountered a fundamental problem: the planner lacked contextual awareness of the current webpage state. To address this, our team split into three groups, each exploring different architectural approaches:
Vision-Only Approach: This system continuously captured browser screenshots, overlaying highlighted boxes around clickable elements with indexed labels. While innovative, dense websites created overlapping boxes and visual clutter that obscured important information, making this approach impractical as a primary solution.
Enhanced Multi-Agent System: We refined the planner-executor communication protocols and attempted to provide contextual feedback. However, the fundamental context gap persisted, leading to execution failures and insufficient attention to goal achievement details.
Single Agent with Memory: This approach would ultimately prove transformative.
The Breakthrough: Single Agent with Enhanced Memory
Our final architecture represented a return to simplicity, but with sophisticated memory management. The single agent now maintained comprehensive memory of previous actions, enabling it to:
-
Understand previously attempted steps
-
Identify completed objectives
-
Make accurate decisions about goal achievement
-
Avoid repetitive action loops
Memory manager which includes list of classes implementing interface Action.
public class MemoryManager <: ToString
{
var stepsCompleted: ArrayList<Action>
var manualMemory: String
var end: Int64
public MemoryManager(){
stepsCompleted = ArrayList<Action>([])
manualMemory = ""
end = 0
}
public func addClick(resultingUrl: String, itemClicked: String, inputText: String, result: Bool){
stepsCompleted.add(Click(resultingUrl,itemClicked,inputText,result))
}
public func addURLNavigator(resultingUrl: String){
stepsCompleted.add(UrlNavigation(resultingUrl))
}
public func addBack(resultingUrl: String){
stepsCompleted.add(Back(resultingUrl))
}
public func addFileUpload(nameOfTheFileUploaded: String){
stepsCompleted.add(FileUpload(nameOfTheFileUploaded))
}
public func addExtraAgent(subGoal: String, executionResult: String){
stepsCompleted.add(ExtraAgent(subGoal, executionResult))
}
public func addScreenshotterCall(response: String){
stepsCompleted.add(ScreenshotterCall(response))
}
public func addCaptchaInteruption(){
stepsCompleted.add(CaptchaInteruption())
}
public func toString(): String{
var out: String = "List of previous completed steps: \n "
for(ind in Range(end ,Int64(stepsCompleted.size-1),1,true,true,true)){
let action = stepsCompleted[ind]
manualMemory += "Operation number ${ind}: ${action}\n"
}
end = stepsCompleted.size
return "List of previous completed steps: \n ${manualMemory}"
}
public func back(): String{
if(stepsCompleted.isEmpty()){
return "No previous steps"
}else{
return stepsCompleted[stepsCompleted.size-1].toString()
}
}
}
Action interface
open class Action <: ToString {
public open func toString(): String{
return "blank"
}
}
And example implementation of Action interface on class Click
class Click <: Action
{
public Click(let resultingUrl: String, let itemClicked: String, let inputText: String, let result: Bool){}
public func success(): String{
var formatted: String = "Clicked: ${itemClicked}"
if(!(inputText.isEmpty())) { formatted += " and filled with text `${inputText}`" }
formatted += ". Resulting URL ${resultingUrl}."
return formatted
}
public func fail(): String{
let formatted: String = "Failed To click and fill: ${itemClicked}, item is not a fillable field, attempted to fill with ${inputText}"
return formatted
}
public func toString(): String{
if(result){
return this.success()
}else{
return this.fail()
}
}
}
Cangjie Magic: The Game Changer
This is where Cangjie Magic demonstrated its exceptional capabilities:
Structured Response Control: Cangjie Magic enabled us to query agents and enforce JSON-formatted responses, allowing us to encode goal completion status as boolean values while capturing the agent’s decision-making rationale.
Tool Management: The framework provided granular control over which MCP server tools were available to each agent through Cangjie wrapper tools. This architectural pattern proved crucial for our memory system implementation.
Memory Integration: By creating Cangjie Magic wrapper tools that mirrored our MCP server tools, we could update our memory class after every interaction, maintaining perfect state consistency.
Terminal Actions: Perhaps most importantly, Cangjie Magic’s ability to convert tools into terminal actions was revolutionary. This feature ensured that after each execution, the agent would exit the current request, enabling our incremental, memory-driven architecture to function correctly.
JavaScript Integration: Powering Real-Time Browser Interaction
While the Cangjie Magic framework formed the architectural backbone of MagicExplorer, our JavaScript interaction layer provided the essential capabilities that made autonomous browser control possible.
What the JavaScript Does
Our JavaScript code acts as the hands and eyes of the AI agent in the browser. It allows us to:
- Inject JS into web pages for scraping DOM elements
- Click buttons, fill out forms, select from dropdowns
- Highlight elements (for the vision agent)
- Handle tabs and solve simple CAPTCHAs
We built this system using Playwright, which allowed us to programmatically control and inject JavaScript into websites in a way that mimics user interaction.
Architecture: MCP Server and Tool Abstraction
The browser interaction logic is abstracted through an MCP (Model Context Protocol) server. The idea is simple but powerful: we expose JavaScript functions as “tools” with descriptive metadata. The AI agent doesn’t need to know how they work internally - it just receives a list of available tools, their descriptions, and how to call them.
The AI sends a JSON request to the MCP server, specifying which tool to call and with what arguments. The server then triggers the correct JS function and returns the result, again as JSON. This modular abstraction allowed the backend and AI agent to evolve independently.
Development Journey: From Monolith to Modular
Initially, everything - scraping logic, Playwright control, and tool definitions - was lumped into a single file. As the project grew, this quickly became unmanageable.
We evolved the structure into modular components, each encapsulating a specific capability. Here’s what it looks like now:
Each folder (e.g., clickingTools, scrapingTools, visionTools) contains logically grouped functionality. At the center is browserManager.js, which holds instances of all tool classes and binds tool names to function references for the MCP server.
This architecture made the system:
- Easier to maintain
- Simpler to debug
- Straightforward to extend
Scraping: The Balancing Act
Scraping turned out to be the trickiest part. At first, we tried scraping everything on a page - but this overwhelmed the agent with irrelevant elements. So we narrowed the scope to just buttons, search bars, and links, giving the agent just enough to reason over.
Later, we refined our approach to filter visible and semantically relevant elements, using CSS selectors and heuristics. We avoided generic <div> elements unless they had meaningful roles. This reduced noise and improved performance dramatically.
Handling the Real Web: Tabs, CAPTCHAs, and Failures
Web automation isn’t just about clicks - real websites introduce complications:
- Tabs: Some buttons open links in new tabs. We had to handle tab switching explicitly using Playwright’s APIs.
- CAPTCHAs: We integrated a CAPTCHA handler that detects image-based CAPTCHAs and sends them to our vision agent for decoding.
- Iframes: These remain a limitation. Our JS injection architecture cannot reach into cross-origin iframes, so websites like Google Docs don’t work.
Whenever an action fails - whether due to a missing element or an unexpected structure - we catch the error and return it to the agent. This allows it to replan or retry intelligently.
Simplicity Over Realism
Unlike some automation platforms, we didn’t aim to mimic human behavior (like mouse movements or slow typing). Instead, we focused on functional control, simulating clicks and inputs directly using Playwright.
What We’re Proud Of
One of the most satisfying outcomes was making scraping actually useful. Initially, we feared the agent would always be flooded with junk. But by iterating on our selectors and filtering logic, we achieved highly accurate, semantically meaningful extractions.
Another highlight was the CAPTCHA solver integration and our ability to handle real-world browsing tasks like account registration, email verification, and even file uploads - all programmatically.
Reflections
If we could go back, we’d start with this modular architecture from day one. But hindsight is 20/20. We learned a lot from evolving the system organically, identifying pain points, and improving based on real use cases.
Right now, our biggest areas for future improvement include:
- Better support for scrolling and lazy-loaded content
- Handling cross-origin iframes
Final Thoughts
Our JavaScript layer abstracts complex browser interactions into simple, AI-callable tools. The result is a powerful agent that can interpret a goal, plan steps, and take real actions inside the browser - all thanks to a robust and well-designed JavaScript interaction layer.
Advanced Features and Hybrid Intelligence
Our final system incorporated sophisticated capabilities:
- File Upload Support: Seamless handling of file uploads across different websites
- Tab Management: Dynamic tab switching and spawning of new browser instances
- Subtask Delegation: Agents could create specialized instances for specific subgoals (particularly valuable for handling confirmation codes during authentication flows)
Vision Agent Integration
We repurposed our earlier vision system as an intelligent fallback mechanism. When semantic information from scraped elements proves insufficient, the executor can consult the vision agent for guidance. This hybrid approach provides several advantages:
Dense Information Processing: The vision system can process more comprehensive webpage information without token limitations
Element Discovery: Ability to interact with elements that were filtered out during optimization
Adaptability: Enhanced performance on poorly designed websites without sacrificing efficiency on standard sites
Results and Impact
MagicExplorer successfully demonstrates autonomous web navigation across a diverse range of tasks. The system can handle complex workflows including:
-
Multi-step authentication processes
-
Form submissions and file uploads
-
Cross-tab navigation and coordination
-
Dynamic content interaction
-
Goal verification and completion confirmation
Lessons Learned
This project reinforced several key principles:
-
Simplicity Over Complexity: Our most successful architecture was ultimately the simplest, enhanced with sophisticated state management
-
Framework Integration: Cangjie Magic’s capabilities were essential to our success, providing the precise control needed for complex agent orchestration
-
Iterative Design: Each architectural iteration taught us valuable lessons that informed our final solution
-
Hybrid Intelligence: Combining different AI approaches (semantic and visual) created a more robust and adaptable system
Future Directions
While MagicEplorer successfully completes a wide variety of tasks, we acknowledge that certain web pages and interaction patterns remain challenging. Our extensible architecture provides a solid foundation for continuous improvement through:
-
Enhanced tool integration
-
Improved error handling and recovery
-
Advanced visual processing capabilities
-
Broader website compatibility testing
Conclusion
This project has shown that building successful AI agents demands more than just powerful models - it also requires robust frameworks that enable precise control and seamless integration.
Our evolution from an initial FFI-based prototype to a hybrid architecture highlights the value of iterative development, architectural adaptability, and a strong tooling foundation. As we continue to enhance MagicExplorer’s capabilities, we’re excited by the opportunities that lie ahead.