The Tango binary translator consists of several components:
- Dynamic translator - The dynamic translator is the core component of Tango. It wraps a 32-bit ARM process and runs it as 64-bit process by translating all of the instructions into AArch64 code.
- Pre-translator - The pre-translator performs off-line translation of executable files to generate persistent caches of translated code. This enables improved application start-up times and reduced memory usage.
- OS integration - Finally, OS integration allows Tango to provide a seamless user experience. This includes kernel modifications to support binary translation, automatic translation of 32-bit processes and pre-translator integration with the OS package manager.
The dynamic translator has two main roles:
- ISA translation, which involves translating AArch32 (ARMv7 32-bit) instruction into AArch64 (ARMv8 64-bit) instructions.
- System emulation, which involves emulating the 32-bit Linux ABI to support the translated program.
Tango supports the full 32-bit ARM and Thumb instruction sets up to ARMv8. This includes full support for floating-point (VFP) and SIMD (NEON) instructions.
Tango translates sequences of AArch32 instructions into AArch64 code fragments. These fragments are stored in a code cache which is shared among all threads in the process. The translator performs many optimizations on fragments, for example:
- Branch linking: direct branch instructions are translated
- Indirect branch optimization: Indirect branches are compiled down to a fast inline hash table lookup.
- Return prediction: Function return branches are predicted using a return address stack.
- Register allocation: Floating-point registers are dynamically allocated, and these allocations are preserved across fragments to reduce spilling.
- Dead code elimination: Redundant instructions are aggressively eliminated if their results are not used.
- Speculative address generation: Address calculations in load/store instructions are speculatively assumed to not overflow, which improves performance.
- Constant inlining: PC-relative constant loads are inlined into fragments to avoid complex address calculations.
The system emulation simulates a 32-bit Linux environment for the translated process and hides the fact that it is running in a 64-bit process. This has several aspects:
- Address space management: The 32-bit process lives in the lower 4GB of the translator's 64-bit address space. All memory management operations performed by the translated process (mmap, mprotect, etc) apply only to this region. The translator also keeps track of which memory regions have execute permission and refuses to translate code from non-executable memory.
- Initial executable loading: On startup, the translator will load the ELF binary passed to it and set up an initial stack for the translated process. This also includes loading the 32-bit dynamic linker if necessary.
- System call translation: All 32-bit system calls issued by the translated process must be emulated using 64-bit system calls. For the majority of system calls, this is done through the kernel's built-in 32-bit compatibility layer. However certain system calls require special handling in the translator, such as those related to signal handling, address space management and thread/process creation.
- Signal handling: The translator can receive 3 types of signals from the OS, each of which is handled differently:
- Internal signals: These signals are used internally by the translator and are hidden from the translated process.
- Synchronous signals: These signals are caused by a specific instruction (typically load/store) in the translated code. The signal context is mapped back to the corresponding source 32-bit instruction and the translated signal handler is invoked.
- Asynchronous signals: These signals are caused by an external event (for example Ctrl-C). The translator will resume execution of the interrupted fragment until a safe point is reached (typically an exit from the current fragment), after which it will invoke the translated signal handler.
The Tango pre-translator takes a 32-bit ARM executable or shared library and produces a persistent cache file for it. These cache files are then loaded by the dynamic translator when the relevant executable or shared library is used in a translated 32-bit process.
Using pre-translated cache files significantly improves the start-up time of new processes since little to no dynamic translation is required. Additionally, the pre-translator can usually produce higher-quality code than the dynamic translator since it has a global view of all code fragments in the binary file.
Pre-translation works in four stages:
- Code discovery: The ELF data structures are scanned to find function entry points. The pre-translator then recursive scans the code at those entry points by following branch instructions to discover all the fragment in the input file.
- Control flow analysis: A control flow graph is built from all the discovered fragments, which is used to perform cross-fragment liveness analysis and provide hints for cross-fragment register allocation.
- Code translation: The actual fragments are then translated into AArch64 instructions, using the same set of optimizations as the dynamic translator.
- Cache file generation: Finally, all the translated fragments are collected, processed into a compact format and written to the disk.
Tango integrates with the operating system in several ways:
- Kernel support: The Linux kernel is modified in several to better support binary translation. Modifications include:
- 32-bit syscall interface: The 32-bit compatibility layer in the kernel is exposed to 64-bit processes, allowing the translator to execute a system call as if it came from a 32-bit process.
- 32-bit sub-address space: 32-bit memory management systems executed from a 64-bit process (using the previously mentioned functionality) will only affect the lower 4GB of the process address space.
- Translator seccomp support: Support for 32-bit seccomp filters while still allowing the translator to issue 64-bit system calls for its internal use.
- Automatic invocation: Using the Linux binfmt_misc subsystem, Tango will automatically be launched whenever a 32-bit ARM program is executed. This makes Tango completely seamless to users: executing 32-bit ARM binaries just works!
- Package manager integration: Tango can be integrated with the OS package manager so that the pre-translator is automatically invoked when a package is installed or updated.
The Tango binary translation technology is based on the PhD research of Amanieu d'Antras, from the University of Manchester. The following publications describe Tango's predecessor, called MAMBO-X64, in more detail:
- Amanieu d'Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2017. Low overhead dynamic binary translation on ARM. SIGPLAN Not. 52, 6 (June 2017), 333-346.
- Amanieu d'Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2016. Optimizing Indirect Branches in Dynamic Binary Translators. ACM Trans. Archit. Code Optim. 13, 1, Article 7 (April 2016), 25 pages.
- Amanieu d'Antras, Cosmin Gorgovan, Jim Garside, John Goodacre, and Mikel Luján. 2017. HyperMAMBO-X64: Using Virtualization to Support High-Performance Transparent Binary Translation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '17). ACM, New York, NY, USA, 228-241.