x86-64 Instruction Encoding
We will check out the encoding of the instructions we authored in x86-64 assembly.
So far, we:
- Compiled a few lines of assembly, using GAS (GNU Assembler) in Linux.
- Switched to Intel syntax from AT&T with GAS, and talked about the differences.
- Figured out the encoding of the x86 instructions we authored in the 32 bit executable.
[ Check out all posts in “low-level” series here. ]
Here is the objdump
of the instructions in the 64 bit object file:
objdump -D --disassembler-options=intel _prog_g64.o
The relevant part of the output:
_prog_g64.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_start>:
0: 48 c7 c0 3c 00 00 00 mov rax,0x3c
7: 48 c7 c7 00 00 00 00 mov rdi,0x0
e: 0f 05 syscall
...
So these look completely different. But we can already see the same patterns emerge for the first 2 instructions.
probably related
to register code
%rax=0, %rdi=7
_/
48 c7 c0 3c 00 00 00 mov rax,0x3c
48 c7 c7 00 00 00 00 mov rdi,0x0
| ___________
| \
? second operand (imm32)
Focusing on the first instruction:
48
: 1st byte is the REX prefix. This value hasREX.W
flag set, which ensures the instruction uses 64 bit operands.C7
: 2nd byte seems to be the primary opcode of the MOV variant that we need.C0
: 3rd byte, in this particular instruction, is called theModR/M
byte. It exists in many instructions, and it has roles like, storing register code for an operand, defining addressing mode etc.
If you check out the layout of ModR/M, it is yet another byte that makes more sense when written in octal. Let’s rewrite that byte based on the information here:
_ opcode or register ?
/
/
ModR/M : 11 000 000
\
\_ register addressing
Regarding the question of opcode or register
, let’s checkout the instruction table of MOV (c7
) again. The table has a column called o
which is 0
for our entry. This is the “opcode extension” value. So those 3 bits are part of the opcode.
As you will remember from the previous post, we expect to see the register codes represented somewhere, each using 3 bits. In fact, it is slightly different in x86-64, because the number of general purpose registers are doubled. So we need 1 more bit to represent each register encoded in the instruction. It is the REX prefix that stores these extra bits.
For this instruction, the register is stored as a combination of least significant 3 bits of ModR/M (called ModR/M.rm
) and least significant bit of REX (called REX.B
). The register code for rax
is 0
.
_ C7 opcode extension
/ _ register code: %rax
/ /
ModR/M : 11 000 000
\
\_ register addressing
And in the case of second instruction, ModR/M byte (C7
) looks like this:
_ C7 opcode extension
/ _ register code: %rdi
/ /
ModR/M : 11 000 111
\
\_ register addressing
The last instruction is this:
0f 05 syscall
There is not much to decode in this one. This is a 2-byte opcode listed here as syscall
.
My understanding is, what these (REX, MomR/M, SIB) byte slots represents is instruction-dependent. And there are is also a newer VEX encoding in x86-64 architecture. But this is all documented in the other resources, and I am not too familiar with VEX yet. So I won’t try to detail instruction encoding any further.