Technology
Overview
The Tango binary translator consists of several components:
Dynamic translator - The dynamic translator is the core component of Tango. It wraps a 32-bit ARM process and runs it as 64-bit process by translating all of the instructions into AArch64 code.
Pre-translator - The pre-translator performs off-line translation of executable files to generate persistent caches of translated code. This enables improved application start-up times and reduced memory usage.
OS integration - Finally, OS integration allows Tango to provide a seamless user experience. Tango has OS integrations for both Android and GNU/Linux.
Dynamic Translator
The dynamic translator has two main roles:
ISA translation, which involves translating AArch32 (ARMv7 32-bit) instruction into AArch64 (ARMv8 64-bit) instructions.
System emulation, which involves emulating the 32-bit Linux ABI to support the translated program.
ISA Translation
Tango supports the full 32-bit ARM and Thumb instruction sets up to ARMv8. This includes full support for floating-point (VFP) and SIMD (NEON) instructions.
Tango translates sequences of AArch32 instructions into AArch64 code fragments. These fragments are stored in a code cache which is shared among all threads in the process. The translator performs many optimizations on fragments, for example:
Branch linking: direct branch instructions are translated
Indirect branch optimization: Indirect branches are compiled down to a fast inline hash table lookup.
Return prediction: Function return branches are predicted using a return address stack.
Register allocation: Floating-point registers are dynamically allocated, and these allocations are preserved across fragments to reduce spilling.
Dead code elimination: Redundant instructions are aggressively eliminated if their results are not used.
Speculative address generation: Address calculations in load/store instructions are speculatively assumed to not overflow, which improves performance.
Constant inlining: PC-relative constant loads are inlined into fragments to avoid complex address calculations.
System Emulation
The system emulation simulates a 32-bit Linux environment for the translated process and hides the fact that it is running in a 64-bit process. This has several aspects:
Address space management: The 32-bit process lives in the lower 4GB of the translator's 64-bit address space. All memory management operations performed by the translated process (mmap, mprotect, etc) apply only to this region. The translator also keeps track of which memory regions have execute permission and refuses to translate code from non-executable memory.
Initial executable loading: On startup, the translator will load the ELF binary passed to it and set up an initial stack for the translated process. This also includes loading the 32-bit dynamic linker if necessary.
System call translation: All 32-bit system calls issued by the translated process must be emulated using 64-bit system calls. In some cases additional support from the kernel is required, for which Tango uses a special kernel module. Certain system calls also require special handling in the translator, such as those related to signal handling, address space management and thread/process creation.
Signal handling: The translator can receive 3 types of signals from the OS, each of which is handled differently:
Internal signals: These signals are used internally by the translator and are hidden from the translated process.
Synchronous signals: These signals are caused by a specific instruction (typically load/store) in the translated code. The signal context is mapped back to the corresponding source 32-bit instruction and the translated signal handler is invoked.
Asynchronous signals: These signals are caused by an external event (for example Ctrl-C). The translator will resume execution of the interrupted fragment until a safe point is reached (typically an exit from the current fragment), after which it will invoke the translated signal handler.
/proc emulation: The translator intercepts accesses to /proc by the translated application to make the contents of this filesystem appear the same as if the application was running natively.
Ptrace emulation: 32-bit debuggers and tools such as strace use this API to inspect the state of another process. Tango fully emulates this functionality so that these tools can observe the 32-bit state of a translated program.
Pre-Translator
The Tango pre-translator takes a 32-bit ARM executable or shared library and produces a persistent cache file for it. These cache files are then loaded by the dynamic translator when the relevant executable or shared library is used in a translated 32-bit process.
Using pre-translated cache files significantly improves the start-up time of new processes since little to no dynamic translation is required. Additionally, the pre-translator can usually produce higher-quality code than the dynamic translator since it has a global view of all code fragments in the binary file.
Pre-translation works in four stages:
Code discovery: The ELF data structures are scanned to find function entry points. The pre-translator then recursive scans the code at those entry points by following branch instructions to discover all the fragment in the input file.
Control flow analysis: A control flow graph is built from all the discovered fragments, which is used to perform cross-fragment liveness analysis and provide hints for cross-fragment register allocation.
Code translation: The actual fragments are then translated into AArch64 instructions, using the same set of optimizations as the dynamic translator.
Cache file generation: Finally, all the translated fragments are collected, processed into a compact format and written to the disk.
OS Integration
Android
System image integration: Tango is directly integrated into the Android system image so that all 32-bit ARM applications are automatically translated.
System image pre-translation: All 32-bit system libraries on the system partitions (/system, /vendor, etc) are pre-translated and have the pre-translated cached included as part of the system image.
On-device pre-translation: Tango support is integrated into the Android package manager so that pre-translation of 32-bit native libraries is automatically done when an application is installed.
GKI compatibility: Tango's kernel module is compatible with Google's Generic Kernel Image.
GNU/Linux
Automatic invocation: Using the Linux binfmt_misc subsystem, Tango will automatically be launched whenever a 32-bit ARM program is executed. This makes Tango completely seamless to users: executing 32-bit ARM binaries just works!
Package manager integration: Tango can be integrated with the OS package manager so that the pre-translator is automatically invoked when an AArch32 package is installed or updated.
Further Reading
The Tango binary translation technology is based on the PhD research of Amanieu d'Antras, from the University of Manchester. The following publications describe Tango's predecessor, called MAMBO-X64, in more detail:
Amanieu d'Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2017. Low overhead dynamic binary translation on ARM. SIGPLAN Not. 52, 6 (June 2017), 333-346.
Amanieu d'Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2016. Optimizing Indirect Branches in Dynamic Binary Translators. ACM Trans. Archit. Code Optim. 13, 1, Article 7 (April 2016), 25 pages.
Amanieu d'Antras, Cosmin Gorgovan, Jim Garside, John Goodacre, and Mikel Luján. 2017. HyperMAMBO-X64: Using Virtualization to Support High-Performance Transparent Binary Translation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '17). ACM, New York, NY, USA, 228-241.