The project to port all of the C code in Emacs to Rust has been going quite well of late, and a progress report is in order.
tl/dr: We're 1/3rd finished! (By one metric)
As you doubtless know, Emacs has an embedded Lisp environment that provides a large number of intersting Lisp functions that the user can call. Many of these are implemented in C for speed, and we've been rewriting them in Rust.
The first thing to look at is the C implementation for
the atan
function. It takes an optional second
argument, which makes it interesting. The complicated
mathematical bits, on the other hand, are handled by the
standard library. This allows us to focus on the porting process
without getting distracted by the math.
The Lisp values we are given as arguments are tagged pointers;
in this case they are pointers to doubles. The code has to check
the tag and follow the pointer to retrieve the real values. Note
that this code invokes a C macro (called DEFUN
)
that reduces some of the boilerplate. The macro declares a
static varable called Satan
that holds the metadata
the Lisp compiler will need in order to successfully call this
function, such as the docstring and the pointer to
the Fatan
function, which is what the C
implementation is named:
DEFUN ("atan", Fatan, Satan, 1, 2, 0, doc: /* Return the inverse tangent of the arguments. If only one argument Y is given, return the inverse tangent of Y. If two arguments Y and X are given, return the inverse tangent of Y divided by X, i.e. the angle in radians between the vector (X, Y) and the x-axis. */) (Lisp_Object y, Lisp_Object x) { double d = extract_float (y); if (NILP (x)) d = atan (d); else { double d2 = extract_float (x); d = atan2 (d, d2); } return make_float (d); }
extract_float
checks the tag
(signalling an "invalid argument" error if it's not the tag
for a double), and returns the actual
value. NILP
checks to see if
the tag indicates that this is a null value, indicating that
the user didn't supply a second argument at all.
Next take a look at the current Rust implementation. It must also take an optional argument, and it also invokes a (Rust) macro to reduce the boilerplate of declaring the static data for the function. However, it also takes care of all of the type conversions and checks that we need to do in order to handle the arguments and return value:
/// Return the inverse tangent of the arguments. /// If only one argument Y is given, return the inverse tangent of Y. /// If two arguments Y and X are given, return the inverse tangent of Y /// divided by X, i.e. the angle in radians between the vector (X, Y) /// and the x-axis #[lisp_fn(min = "1")] pub fn atan(y: EmacsDouble, x: Option<EmacsDouble>) -> EmacsDouble { match x { None => y.atan(), Some(x) => y.atan2(x) } }
You can see that we don't have to check to see if our
arguments are of the correct type; the code generated by
the lisp_fn
macro does this for us. We also asked
for the second argument to be
an Option<EmacsDouble>
; this is the Rust
type for a value which is either a valid double or isn't
specified at all. We use a match statement to handle both
cases.
This code is so much better that it's hard to believe just
how simple
the implementation of the macro is. It just calls
.into()
on the arguments and the return value; the
compiler does the rest when it dispatches this method call to
the correct implementation.
So far we've ported 394 individual Lisp functions from C to Rust, of which 207 were ported in this last year. This is about a third of the total, as you can see by this graph. We've actually completely ported several whole C files now. In no particular order: Sean Perry finished off src/floatfns.c, brotzeit cleared out src/marker.c, I emptied src/cmds.c, and Harry Fei obliterated src/decompress.c.
Because part of Remacs is written in Rust, while the bulk of it is still in C, both the Rust and the C code must be able to call functions written in the other language. Rust makes this fairly painless; it has excellent support for C FFI.
However, manually translating function and structure declarations into Rust can be quite painful. Worse, any tiny mistake will come back to haunt you later. Crashes and weird bugs that don't make sense are a very real problem. We had several itermittant bugs that were introduced when a complicated struct was incorrectly translated into Rust, so that parts of the code were stepping on each other.
We've fixed this problem by using Bindgen to generate these bindings for us.
Aside from saving us a lot of time, Bindgen also gives us relatively nice ways to handle C enums, unions, bitfields, and variable-length structures. Emacs frequently uses these, so this is a great help.
First allow me to show you a fairly important C structure
called Lisp_Symbol
. This
struct holds all of the information that Emacs knows about a
Lisp symbol. It's got a number of bit fields as well as an
internal union. Note that I've elided the comments from this
declaration:
struct Lisp_Symbol { bool_bf gcmarkbit : 1; ENUM_BF (symbol_redirect) redirect : 3; ENUM_BF (symbol_trapped_write) trapped_write : 2; unsigned interned : 2; bool_bf declared_special : 1; bool_bf pinned : 1; Lisp_Object name; union { Lisp_Object value; struct Lisp_Symbol *alias; struct Lisp_Buffer_Local_Value *blv; union Lisp_Fwd *fwd; } val; Lisp_Object function; Lisp_Object plist; struct Lisp_Symbol *next; };
ENUM_BF
and bool_bf
are C preprocessor
hacks that allow the code to be compiled even when the compiler
doesn't support enums or bools as bitfield types. Bindgen
generates the following Rust struct:
#[repr(C)] pub struct Lisp_Symbol { pub _bitfield_1: __BindgenBitfieldUnit<[u8; 2usize], u8>, pub name: Lisp_Object, pub val: Lisp_Symbol__bindgen_ty_1, pub function: Lisp_Object, pub plist: Lisp_Object, pub next: *mut Lisp_Symbol, } #[repr(C)] pub union Lisp_Symbol__bindgen_ty_1 { pub value: Lisp_Object, pub alias: *mut Lisp_Symbol, pub blv: *mut Lisp_Buffer_Local_Value, pub fwd: *mut Lisp_Fwd, _bindgen_union_align: u64, }
As you can see, the bitfields become rather opaque; they're
no longer listed in the struct. (You can, however, still see
that they occupy two bytes in the struct.) Instead, Bindgen
creates getter and setter methods and adds them to
the impl Lisp_Symbol
. I'll
just show an excerpt here:
impl Lisp_Symbol { #[inline] pub fn gcmarkbit(&self) -> bool_bf { unsafe { ::std::mem::transmute(self._bitfield_1.get(0usize, 1u8) as u8) } } #[inline] pub fn set_gcmarkbit(&mut self, val: bool_bf) { unsafe { let val: u8 = ::std::mem::transmute(val); self._bitfield_1.set(0usize, 1u8, val as u64) } } // ---8<--- }
The union is also a little more verbose than before, as it cannot be put anonymously into the rest of the struct; Rust requires that it have a proper name, and so Bindgen has generated one. It's not great, but it'll suffice.
You may also be aware that the C code must quickly and frequently access the current value of a large number of Lisp variables. To make this possible, the C code stores these values in global variables. Yes, lots of global variables. In fact, these aren't just file globals accessible to only one translation unit, these are static variables that are accessible across the whole program. We've started porting these to Rust now as well.
DEFVAR_LISP ("post-self-insert-hook", Vpost_self_insert_hook, doc: /* Hook run at the end of `self-insert-command'. This is run after inserting the character. */); Vpost_self_insert_hook = Qnil;
Like DEFUN
, DEFVAR_LISP
takes both a Lisp name and the C name. The C name becomes the
name of the global variable, while the Lisp name is what gets
used in Lisp source code. Setting the default value of this
variable happens in a separate statement, which is fine.
/// Hook run at the end of `self-insert-command'. /// This is run after inserting the character. defvar_lisp!(Vpost_self_insert_hook, "post-self-insert-hook", Qnil);
The Rust version must still take both names (this could be simplified if we wrote this macro using a procedural macro), but it also takes a default value. As before, the docstring becomes a comment which all other Rust tooling will recognize.
You might be interested in how this is implemented as well:
#define DEFVAR_LISP(lname, vname, doc) \ do { \ static struct Lisp_Objfwd o_fwd; \ defvar_lisp (&o_fwd, lname, &globals.f_ ## vname); \ } while (false)
The C macro is not very complicated, but there are two
somewhat subtle points. First, it creates an (uninitialized)
static variable called o_fwd
,
of type Lisp_Objfwd
. This
holds the variable's value, which is a
a Lisp_Object
. It then calls
the defvar_lisp
function to
initialize the fields of this struct, and also to register the
variable in the Lisp runtime's global environment, making it
accessible to Lisp code.
The first subtle point is that every invocation of this marco
uses the same variable
name, o_fwd
. If you call this
macro more than once inside the same scope, then they would
all be the exact same static variable. Instead the macro body
is wrapped inside a do while false loop so that each one has a
separate little scope to live in.
The other subtlty is that
the Lisp_Objfwd
struct
actually only has a pointer to the value; we still have to
allocate some storage for that value somewhere. We take the
address of a field on something
called globals
here; that's
the real storage
location. This globals
object
is just a big global struct that holds all the global
variables; one day when Emacs is really multi-threaded, there
can be one of these per thread and a lot of the rest of the
code will just work.
#[macro_export] macro_rules! defvar_lisp { ($field_name:ident, $lisp_name:expr, $value:expr) => {{ #[allow(unused_unsafe)] unsafe { #[allow(const_err)] static mut o_fwd: ::hacks::Hack<::data::Lisp_Objfwd> = unsafe { ::hacks::Hack::uninitialized() }; ::remacs_sys::defvar_lisp( o_fwd.get_mut(), concat!($lisp_name, "\0").as_ptr() as *const i8, &mut ::remacs_sys::globals.$field_name, ); ::remacs_sys::globals.$field_name = $value; } }}; }
The Rust version of this macro is rather longer. Primarily this is because it takes a lot more typing to get a proper uninitialized value in a Rust program. Some would argue that all of this typing is a bad thing, but this is very much an unsafe operation. We're basically promising very precisely that we know this value is uninitialized, and that it will be completely and correctly initialized by the end of this unsafe block.
We then call the
same defvar_lisp
function with
the same values, so that
the Lisp_Objfwd
struct gets
initialized and registered in exactly the same way as in the C
code. We do have take care to ensure that the Lisp name of the
variable is a null-terminated string though.
If you know some Rust and some C, or know one but want to learn the other, then you could do a lot worse than to lend a hand. If anything you've read here was interesting to you (and you read this far so I have to assume that was the case), then we would love to have your help. If you'd like to talk to anyone involved in the project, you can join us in our chat room; many of us hang out there.