Continued progress porting Emacs to Rust

The project to port all of the C code in Emacs to Rust has been going quite well of late, and a progress report is in order.

tl/dr: We're 1/3rd finished! (By one metric)

Porting Lisp Functions

As you doubtless know, Emacs has an embedded Lisp environment that provides a large number of intersting Lisp functions that the user can call. Many of these are implemented in C for speed, and we've been rewriting them in Rust.

The first thing to look at is the C implementation for the atan function. It takes an optional second argument, which makes it interesting. The complicated mathematical bits, on the other hand, are handled by the standard library. This allows us to focus on the porting process without getting distracted by the math.

The Lisp values we are given as arguments are tagged pointers; in this case they are pointers to doubles. The code has to check the tag and follow the pointer to retrieve the real values. Note that this code invokes a C macro (called DEFUN) that reduces some of the boilerplate. The macro declares a static varable called Satan that holds the metadata the Lisp compiler will need in order to successfully call this function, such as the docstring and the pointer to the Fatan function, which is what the C implementation is named:

DEFUN ("atan", Fatan, Satan, 1, 2, 0,
       doc: /* Return the inverse tangent of the arguments.
If only one argument Y is given, return the inverse tangent of Y.
If two arguments Y and X are given, return the inverse tangent of Y
divided by X, i.e. the angle in radians between the vector (X, Y)
and the x-axis.  */)
  (Lisp_Object y, Lisp_Object x)
{
  double d = extract_float (y);

  if (NILP (x))
    d = atan (d);
  else
    {
      double d2 = extract_float (x);
      d = atan2 (d, d2);
    }
  return make_float (d);
}
        

extract_float checks the tag (signalling an "invalid argument" error if it's not the tag for a double), and returns the actual value. NILP checks to see if the tag indicates that this is a null value, indicating that the user didn't supply a second argument at all.

Next take a look at the current Rust implementation. It must also take an optional argument, and it also invokes a (Rust) macro to reduce the boilerplate of declaring the static data for the function. However, it also takes care of all of the type conversions and checks that we need to do in order to handle the arguments and return value:

/// Return the inverse tangent of the arguments.
/// If only one argument Y is given, return the inverse tangent of Y.
/// If two arguments Y and X are given, return the inverse tangent of Y
/// divided by X, i.e. the angle in radians between the vector (X, Y)
/// and the x-axis
#[lisp_fn(min = "1")]
pub fn atan(y: EmacsDouble, x: Option<EmacsDouble>) -> EmacsDouble {
    match x {
        None => y.atan(),
        Some(x) => y.atan2(x)
    }
}
        

You can see that we don't have to check to see if our arguments are of the correct type; the code generated by the lisp_fn macro does this for us. We also asked for the second argument to be an Option<EmacsDouble>; this is the Rust type for a value which is either a valid double or isn't specified at all. We use a match statement to handle both cases.

This code is so much better that it's hard to believe just how simple the implementation of the macro is. It just calls .into() on the arguments and the return value; the compiler does the rest when it dispatches this method call to the correct implementation.

So far we've ported 394 individual Lisp functions from C to Rust, of which 207 were ported in this last year. This is about a third of the total, as you can see by this graph. We've actually completely ported several whole C files now. In no particular order: Sean Perry finished off src/floatfns.c, brotzeit cleared out src/marker.c, I emptied src/cmds.c, and Harry Fei obliterated src/decompress.c.

Automation via Bindgen

Because part of Remacs is written in Rust, while the bulk of it is still in C, both the Rust and the C code must be able to call functions written in the other language. Rust makes this fairly painless; it has excellent support for C FFI.

However, manually translating function and structure declarations into Rust can be quite painful. Worse, any tiny mistake will come back to haunt you later. Crashes and weird bugs that don't make sense are a very real problem. We had several itermittant bugs that were introduced when a complicated struct was incorrectly translated into Rust, so that parts of the code were stepping on each other.

We've fixed this problem by using Bindgen to generate these bindings for us.

Aside from saving us a lot of time, Bindgen also gives us relatively nice ways to handle C enums, unions, bitfields, and variable-length structures. Emacs frequently uses these, so this is a great help.

First allow me to show you a fairly important C structure called Lisp_Symbol. This struct holds all of the information that Emacs knows about a Lisp symbol. It's got a number of bit fields as well as an internal union. Note that I've elided the comments from this declaration:

struct Lisp_Symbol
{
  bool_bf gcmarkbit : 1;
  ENUM_BF (symbol_redirect) redirect : 3;
  ENUM_BF (symbol_trapped_write) trapped_write : 2;
  unsigned interned : 2;
  bool_bf declared_special : 1;
  bool_bf pinned : 1;
  Lisp_Object name;
  union {
    Lisp_Object value;
    struct Lisp_Symbol *alias;
    struct Lisp_Buffer_Local_Value *blv;
    union Lisp_Fwd *fwd;
  } val;
  Lisp_Object function;
  Lisp_Object plist;
  struct Lisp_Symbol *next;
};
        

ENUM_BF and bool_bf are C preprocessor hacks that allow the code to be compiled even when the compiler doesn't support enums or bools as bitfield types. Bindgen generates the following Rust struct:

#[repr(C)]
pub struct Lisp_Symbol {
    pub _bitfield_1: __BindgenBitfieldUnit<[u8; 2usize], u8>,
    pub name: Lisp_Object,
    pub val: Lisp_Symbol__bindgen_ty_1,
    pub function: Lisp_Object,
    pub plist: Lisp_Object,
    pub next: *mut Lisp_Symbol,
}
#[repr(C)]
pub union Lisp_Symbol__bindgen_ty_1 {
    pub value: Lisp_Object,
    pub alias: *mut Lisp_Symbol,
    pub blv: *mut Lisp_Buffer_Local_Value,
    pub fwd: *mut Lisp_Fwd,
    _bindgen_union_align: u64,
}
        

As you can see, the bitfields become rather opaque; they're no longer listed in the struct. (You can, however, still see that they occupy two bytes in the struct.) Instead, Bindgen creates getter and setter methods and adds them to the impl Lisp_Symbol. I'll just show an excerpt here:

impl Lisp_Symbol {
    #[inline]
    pub fn gcmarkbit(&self) -> bool_bf {
        unsafe { ::std::mem::transmute(self._bitfield_1.get(0usize, 1u8) as u8) }
    }
    #[inline]
    pub fn set_gcmarkbit(&mut self, val: bool_bf) {
        unsafe {
            let val: u8 = ::std::mem::transmute(val);
            self._bitfield_1.set(0usize, 1u8, val as u64)
        }
    }
    // ---8<---
}
        

The union is also a little more verbose than before, as it cannot be put anonymously into the rest of the struct; Rust requires that it have a proper name, and so Bindgen has generated one. It's not great, but it'll suffice.

Porting Lisp Variables

You may also be aware that the C code must quickly and frequently access the current value of a large number of Lisp variables. To make this possible, the C code stores these values in global variables. Yes, lots of global variables. In fact, these aren't just file globals accessible to only one translation unit, these are static variables that are accessible across the whole program. We've started porting these to Rust now as well.

  DEFVAR_LISP ("post-self-insert-hook", Vpost_self_insert_hook,
              doc: /* Hook run at the end of `self-insert-command'.
This is run after inserting the character.  */);
  Vpost_self_insert_hook = Qnil;
        

Like DEFUN, DEFVAR_LISP takes both a Lisp name and the C name. The C name becomes the name of the global variable, while the Lisp name is what gets used in Lisp source code. Setting the default value of this variable happens in a separate statement, which is fine.

    /// Hook run at the end of `self-insert-command'.
    /// This is run after inserting the character.
    defvar_lisp!(Vpost_self_insert_hook, "post-self-insert-hook", Qnil);
        

The Rust version must still take both names (this could be simplified if we wrote this macro using a procedural macro), but it also takes a default value. As before, the docstring becomes a comment which all other Rust tooling will recognize.

You might be interested in how this is implemented as well:

#define DEFVAR_LISP(lname, vname, doc)		\
  do {						\
    static struct Lisp_Objfwd o_fwd;		\
    defvar_lisp (&o_fwd, lname, &globals.f_ ## vname);		\
  } while (false)
        

The C macro is not very complicated, but there are two somewhat subtle points. First, it creates an (uninitialized) static variable called o_fwd, of type Lisp_Objfwd. This holds the variable's value, which is a a Lisp_Object. It then calls the defvar_lisp function to initialize the fields of this struct, and also to register the variable in the Lisp runtime's global environment, making it accessible to Lisp code.

The first subtle point is that every invocation of this marco uses the same variable name, o_fwd. If you call this macro more than once inside the same scope, then they would all be the exact same static variable. Instead the macro body is wrapped inside a do while false loop so that each one has a separate little scope to live in.

The other subtlty is that the Lisp_Objfwd struct actually only has a pointer to the value; we still have to allocate some storage for that value somewhere. We take the address of a field on something called globals here; that's the real storage location. This globals object is just a big global struct that holds all the global variables; one day when Emacs is really multi-threaded, there can be one of these per thread and a lot of the rest of the code will just work.

#[macro_export]
macro_rules! defvar_lisp {
    ($field_name:ident, $lisp_name:expr, $value:expr) => {{
        #[allow(unused_unsafe)]
        unsafe {
            #[allow(const_err)]
            static mut o_fwd: ::hacks::Hack<::data::Lisp_Objfwd> =
                unsafe { ::hacks::Hack::uninitialized() };
            ::remacs_sys::defvar_lisp(
                o_fwd.get_mut(),
                concat!($lisp_name, "\0").as_ptr() as *const i8,
                &mut ::remacs_sys::globals.$field_name,
            );
            ::remacs_sys::globals.$field_name = $value;
        }
    }};
}
        

The Rust version of this macro is rather longer. Primarily this is because it takes a lot more typing to get a proper uninitialized value in a Rust program. Some would argue that all of this typing is a bad thing, but this is very much an unsafe operation. We're basically promising very precisely that we know this value is uninitialized, and that it will be completely and correctly initialized by the end of this unsafe block.

We then call the same defvar_lisp function with the same values, so that the Lisp_Objfwd struct gets initialized and registered in exactly the same way as in the C code. We do have take care to ensure that the Lisp name of the variable is a null-terminated string though.

Want to help?

If you know some Rust and some C, or know one but want to learn the other, then you could do a lot worse than to lend a hand. If anything you've read here was interesting to you (and you read this far so I have to assume that was the case), then we would love to have your help. If you'd like to talk to anyone involved in the project, you can join us in our chat room; many of us hang out there.