Tuesday, August 22, 2017

GCC low-level IR and basic code generation

This is part three of a series “Writing a GCC back end”.

Compilers are usually described as having three parts – a front end that handles the source language, a middle that optimizes the code using a target-independent representation, and a back end doing the target-specific code generation. GCC is no different in this regard – the front end generates a target-independent representation (GIMPLE) that is used when optimizing the code, and the result is passed to the back end that converts it to its own representation (RTL) and generates the code for the program.

The back end IR

The back end’s internal representation for the code consists of a linked list of objects called insns. An insn corresponds roughly to an assembly instruction, but there are insns representing labels, dispatch tables for switch statements, etc.. Each insnsn is constructed as a tree of expressions, and is usually written in an LISP-like syntax. For example,
(reg:m n)
is an expression representing a register access, and
(plus:m x y)
represents adding the expressions x and y. An insn adding two registers may combine these as
(set (reg:m r0) (plus:m (reg:m r1) (reg:m r2)))
The m in the expressions denotes a machine mode that defines the size and representation of the data object or operation in the expression. There are lots of machine modes, but the most common are
  • QI – “Quarter-Integer” mode represents a single byte treated as an integer.
  • HI – “Half-Integer” mode represents a two-byte integer.
  • SI – “Single Integer” mode represents a four-byte integer.
  • DI – “Double Integer” mode represents an eight-byte integer.
  • CC – “Condition Code” mode represents the value of a condition code (used to represent the result of a comparison operation).
GCC supports architectures where a byte is not 8 bits, but this blog series will assume 8-bit bytes (mostly as I often find it clearer to talk about a 32-bit value in examples, instead of a more abstract SImode value).

Overview of the back end operation

The back end runs a number of passes over the IR, and GCC will output the resulting RTL for each pass when -fdump-rtl-all is passed to the compiler.

The back end starts by converting GIMPLE to RTL, and a small GIMPLE function such as
foo (int a)
{
  int _2;

  <bb 2> [100.00%]:
  _2 = a_1(D) * 42;
  return _2;
}
is expanded to
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v:SI 73 [ a ])
        (reg:SI 10 a0 [ a ])) "foo.c":2 -1
     (nil))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:SI 75)
        (const_int 42 [0x2a])) "foo.c":3 -1
     (nil))
(insn 7 6 8 2 (set (reg:SI 74)
        (mult:SI (reg/v:SI 73 [ a ])
            (reg:SI 75))) "foo.c":3 -1
     (nil))
(insn 8 7 12 2 (set (reg:SI 72 [ <retval> ])
        (reg:SI 74)) "foo.c":3 -1
     (nil))
(insn 12 8 13 2 (set (reg/i:SI 10 a0)
        (reg:SI 72 [ <retval> ])) "foo.c":4 -1
     (nil))
(insn 13 12 0 2 (use (reg/i:SI 10 a0)) "foo.c":4 -1
     (nil))
The generated RTL corresponds mostly to real instructions even at this early stage in the back end, but the generated code it is inefficient and registers are still virtual.

The next step is to run optimization passes on the RTL. This is the same kind of optimization passes that have already been run on GIMPLE (constant folding, dead code elimination, simple loop optimizations, etc.), but they can do a better job with knowledge of the target architecture. For example, loop optimizations may transform loops to better take advantage of loop instructions and addressing modes, and dead code elimination may see that the operations working on the upper part of
foo (long long int a, long long int b)
{
  long long int _1;
  long long int _4;

  <bb 2> [100.00%]:
  _1 = a_2(D) + b_3(D);
  _4 = _1 & 255;
  return _4;
}
on a 32-bit architecture are not needed (as the returned value is always 0) and can thus be eliminated.

After this, instructions are combined and spit to better instruction sequences by peep-hole optimizations, registers are allocated, the instructions are scheduled, and the resulting RTL dump contains all the information about which instructions are selected and registers allocated
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 17 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 17 4 2 NOTE_INSN_PROLOGUE_END)
(note 2 17 3 NOTE_INSN_DELETED)
(note 3 2 7 NOTE_INSN_FUNCTION_BEG)
(note 7 3 15 NOTE_INSN_DELETED)
(insn 15 7 12 (set (reg:SI 15 a5 [75])
        (const_int 42 [0x2a])) "foo.c":4 132 {*movsi_internal}
     (expr_list:REG_EQUIV (const_int 42 [0x2a])
        (nil)))
(insn 12 15 13 (set (reg/i:SI 10 a0)
        (mult:SI (reg:SI 10 a0 [ a ])
            (reg:SI 15 a5 [75]))) "foo.c":4 15 {mulsi3}
     (expr_list:REG_DEAD (reg:SI 15 a5 [75])
        (nil)))
(insn 13 12 21 (use (reg/i:SI 10 a0)) "foo.c":4 -1
     (nil))
(note 21 13 19 NOTE_INSN_EPILOGUE_BEG)
(jump_insn 19 21 20 (simple_return) "foo.c":4 211 {simple_return}
     (nil)
 -> simple_return)
(barrier 20 19 16)
(note 16 20 0 NOTE_INSN_DELETED)

Instruction patterns

The target architecture’s instructions are described in machine.md using instruction patterns. A simple instruction pattern looks like
(define_insn "mulsi3"
  [(set (match_operand:SI 0 "register_operand" "=r")
        (mult:SI (match_operand:SI 1 "register_operand" "r")
                 (match_operand:SI 2 "register_operand" "r")))]
  ""
  "mul\t%0,%1,%2")
which defines an insn with a name mulsi3 that generates a mul instruction for a 32-bit multiplication.

The first operand in the instruction pattern is a name that is used in debug dumps and when writing C++ code that generates RTL. For example,
emit_insn (gen_mulsi3 (dst, src1, src2));
generates a mulsi3 insn. The back end does not in general need to generate RTL, but the rest of GCC does, and the name tells GCC that it can use the pattern to accomplish a certain task. It is therefore important that the named patterns implement the functionality that GCC expects, but names starting with * are ignored and thus safe to use for non-standard instruction patterns. The back end should only implement the named patterns that make sense for the architecture – GCC will do its best to emit code for missing patterns using other strategies. For example, a 32-bit multiplication will be generated as a call to __mulsi3 in libgcc if the target does not have a mulsi3 instruction pattern.

The next part of the instruction pattern is the RTL template that describes the semantics of the instruction that is generated by the insn
[(set (match_operand:SI 0 "register_operand" "=r")
      (mult:SI (match_operand:SI 1 "register_operand" "r")
               (match_operand:SI 2 "register_operand" "r")))]
This says that the instruction multiplies two registers and places the result in a register. This example is taken from RISC-V that has a multiplication instruction without any side-effects, but some architectures (such as X86_64) sets condition flags as part of the operation, and that needs to be expressed in the RTL template, such as
[(parallel [(set (match_operand:SI 0 "register_operand" "=r")
                 (mult:SI (match_operand:SI 1 "register_operand" "r")
                          (match_operand:SI 2 "register_operand" "r")))
            (clobber (reg:CC FLAGS_REG))])]
where the parallel expresses that the two operations are done in as a unit.

The instruction’s operands are specified with expressions of the form
(match_operand:SI 1 "register_operand" "r")
consisting of four parts:
  • match_operand: followed by the machine mode of the operand.
  • The operand number used as an identifier when referring the operand.
  • A predicate telling what kind of operands are valid ("register_operand" means that it must be a register).
  • A constraint string describing the details of the operand ("r" means it must be a general register). These are the same constraints as are used in inline assembly (but the instruction patterns support additional constraints that are not allowed in inline assembly).
The predicate and constraint string contain similar information, but they are used in different ways:
  • The predicate is used when generating the RTL. As an example, when generating RTL for a GIMPLE function
    foo (int a)
    {
      int _2;
    
      <bb 2> [100.00%]:
      _2 = a_1(D) * 42;
      return _2;
    }
    
    then the GIMPLE to RTL converter will generate the multiplication as a mulsi3 insn. The predicates will be checked for the result _2 and the operands a_1(D) and 42, and it is determined that 42 is not valid as only registers are allowed. GCC will, therefore, insert a movsi insn that moves the constant into a register.
  • The constraints are used when doing the register allocation and final instruction selection. Many architectures have constraints on the operands, such as m68k that has 16 registers (a0a7 and d0d7), but only d0d7 are allowed in a muls instruction. This is expressed by a constraint telling the register allocator that it must use register d0d7.

The string after the RTL template contains C++ code that disables the instruction pattern if it evaluates to false (an empty string is treated as true). This is used when having different versions of the architecture, such as one for small CPUs that do not have multiplication instructions, and one for more advanced cores that can do multiplication. This is typically handled by letting machine.opt generate a global variable TARGET_MUL that can be set by an option such as -mmul and -march, and this global variable is placed in the condition string
(define_insn "mulsi3"
  [(set (match_operand:SI 0 "register_operand" "=r")
        (mult:SI (match_operand:SI 1 "register_operand" "r")
                 (match_operand:SI 2 "register_operand" "r")))]
  "TARGET_MUL"
  "mul\t%0,%1,%2")
so that the instruction pattern it is disabled (and the compiler will thus generate a call to libgcc) when multiplication is not available.

The resulting instruction is emitted using the output template
"mul\t%0,%1,%2"
where %0, %1, ..., are substituted with the corresponding operands.

More about define_insn

Many architectures need more complex instruction handling than the RISC-V mul instruction described above, but define_insn is flexible enough to handle essentially all cases that occur for real CPUs.

Let’s say our target can multiply a register and an immediate integer and that this requires the first operand to be an even-numbered register, while multiplying two registers requires that the first operand is an odd-numbered register (this is not as strange as it may seem – some CPU designs use such tricks to save one bit in the instruction encoding). This is easily handled by defining a new predicate in predicates.md
(define_predicate "reg_or_int_operand"
  (ior (match_code "const_int")
       (match_operand 0 "register_operand")))
accepting a register or an integer, and new constraints in constraints.md that require an odd or an even register
(define_register_constraint "W" "ODD_REGS"
  "An odd-numbered register.")

(define_register_constraint "D" "EVEN_REGS"
  "An even-numbered register.")
where ODD_REGS and EVEN_REGS are register classes (see part four in this series). We can now use this in the instruction pattern
(define_insn "mulsi3"
  [(set (match_operand:SI 0 "register_operand" "=r,r")
        (mult:SI (match_operand:SI 1 "register_operand" "%W,D")
                 (match_operand:SI 2 "reg_or_int_operand" "r,i")))]
  ""
  "mul\t%0,%1,%2")
The constraint strings now list two alternatives – one for the register/register case and one for the register/integer case. And there is a % character added to tell the back end that the operation is commutative (so that the code generation may switch the order of the operands if the integer is the first operand, or help the register allocation for cases such as
_1 = _2 * 42;
_3 = _2 * _4;
to let it change the order of arguments on the second line to avoid inserting an extra move to make _2 an even register on the first line and an odd register on the second line).

Sometimes the different alternatives need to generate different instructions, such as the instruction multiplying two registers being called mulr and the multiplication with an integer being called muli. This can be handled by starting the output template string with an @ character and listing the different alternatives in the same order as in the constraint strings
(define_insn "mulsi3"
  [(set (match_operand:SI 0 "register_operand" "=r,r")
        (mult:SI (match_operand:SI 1 "register_operand" "%W,D")
                 (match_operand:SI 2 "reg_or_int_operand" "r,i")))]
  ""
  "@
   mulr\t%0,%1,%2
   muli\t%0,%1,%2")
Finally, it is possible to write general C++ code that is run when outputting the instruction, so the previous could have been written as
(define_insn "mulsi3"
  [(set (match_operand:SI 0 "register_operand" "=r,r")
        (mult:SI (match_operand:SI 1 "register_operand" "%W,D")
                  (match_operand:SI 2 "reg_or_int_operand" "r,i")))]
  ""
  {
    return which_alternative == 0 ? "mulr\t%0,%1,%2" : "muli\t%0,%1,%2";
  })
This is usually not that useful for define_insn but may reduce the number of instruction patterns when the instruction names depend on the configuration. For example, mulsi3 in the RISC-V back end must generate mulw in 64-bit mode and mul in 32-bit mode, which is implemented as
(define_insn "mulsi3"
  [(set (match_operand:SI 0 "register_operand" "=r")
        (mult:SI (match_operand:SI 1 "register_operand" "r")
                 (match_operand:SI 2 "register_operand" "r")))]
  ""
  { return TARGET_64BIT ? "mulw\t%0,%1,%2" : "mul\t%0,%1,%2"; })
where TARGET_64BIT is a global variable defined in riscv.opt.

Further reading

This blog post has only scratched the surface of the RTL and machine description functionality, but everything is documented in “GNU Compiler Collection Internals”:

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.