Introduction

Historically, the LLInt and Baseline JIT haven’t been the source for may publicly disclosed security related bugs in JavaScriptCore but there are a few reasons why it felt necessary to dedicate an entire post solely to the LLInt and Baseline JIT. The main goal of this post and the blog series is to help researchers navigate the code base and to help speed up the analysis process when triaging bugs/crashes. Understanding how the LLInt and Baseline JIT work and the various components that aid in the functioning of these two tier will help speed up this analysis by helping one finding their bearings within the code base and narrowing the search space being able to skip over components that don’t impact the bug/crash. The second reason to review these two tiers is to gain an appreciation for how code generation and execution flow is achieved for unoptimised bytecode. The design principles used by the LLInt and Baseline JIT are shared across the higher tiers and gaining an understanding of these principles makes for a gentle learning curve when exploring the DFG and FTL.

Part I of this blog series traced the journey from source code to bytecode and concluded by briefly discussing how bytecodes are passed to the LLInt and how the LLInt initiates execution. This post dives into the details on how bytecode is executed in the LLInt and how the Baseline JIT is invoked to optimise the bytecode. These are the first two tiers in JavaScriptCore and the stages of the execution pipeline that will be explored are shown in the slide1 reproduced below:

pipeline-stages

This blog post begins by exploring how the LLInt is constructed using a custom assembly called offlineasm and how one can debug and trace bytecode execution in this custom assembly. It also covers the working of the Baseline JIT and demonstrates how the LLInt determines when bytecode being executed is hot code and should be compiled and executed by the Baseline JIT. The LLInt and Baseline JIT are considered as profiling tiers and this post concludes with a quick introduction on the various profiling sources that the two tiers use. Part III dives into the internals of the Data Flow Graph (DFG) and how bytecode is optimised and generated by this JIT compiler.

Existing Work

In addition to the resources mentioned in Part I, there are a couple of resources from that discuss several aspects of JavaScript code optimisation techniques, some of which will be covered in this post and posts that will follow.

The WebKit blog: Speculation in JavaScriptCore is a magnum opus by Filip Pizlo that goes into the great detail of how speculative compilation is performed in JavaScriptCore and is a fantastic complementary resource to this blog series. The key areas that the WebKit blog discusses which will be relevant to our discussion here are the sections on Control and Profiling.

Another useful resource that will come in handy as you debug and trace the LLInt and Baseline JIT is the WebKit blog JavaScriptCore CSI: A Crash Site Investigation Story. This is also a good resource to get you started with debugging WebKit/JavaScriptCore crashes.

LLInt

This section beings by introducing the LLInt and the custom assembly that is used to construct the LLInt. The LLInt was first introduced in WebKit back in 2012 with the following revision. The revision comment below states the intent of why the LLInt was introduced.

Implemented an interpreter that uses the JIT’s calling convention. This interpreter is called LLInt, or the Low Level Interpreter. JSC will now will start by executing code in LLInt and will only tier up to the old JIT after the code is proven hot.

LLInt is written in a modified form of our macro assembly. This new macro assembly is compiled by an offline assembler (see offlineasm), which implements many modern conveniences such as a Turing-complete CPS-based macro language and direct access to relevant C++ type information (basically offsets of fields and sizes of structs/classes).

Code executing in LLInt appears to the rest of the JSC world “as if” it were executing in the old JIT. Hence, things like exception handling and cross-execution-engine calls just work and require pretty much no additional overhead.

Essentially the LLInt loops over bytecodes and executes each bytecode based on what it’s supposed to do and then moves onto the next bytecode instruction. In addition to bytecode execution, it also gathers profiling information about the bytecodes being executed and maintains counters that measure how often code was executed. Both these parameters (i.e. profiling data and execution counts) are crucial in aiding code optimisation and tiering up to the various JIT tiers via a technique called OSR (On Stack Replacement). The screenshot2 below describes the Four JIT tiers and profiling data and OSRs propagate through the engine.

profiling-and-osr

The source code to the LLInt is located at JavaScriptCore/llint and the starting point to this post’s investigation will be the LLIntEntrypoint.h which is also the first instance where the LLInt was first encountered in Part I.

Recap

Lets pick up from where Part I left off in Interpreter::executeProgram.

CodeBlock* tempCodeBlock;
Exception* error = program->prepareForExecution<ProgramExecutable>(vm, nullptr, scope, CodeForCall, tempCodeBlock);

The pointer to the program object which has a reference to the CodeBlock that now contains linked bytecode. The call to prepareForExecution through a series of calls ends up calling setProgramEntrypoint(CodeBlock* codeBlock). This function as the name suggests is responsible for setting up the entry point into the LLInt to being executing bytecode. The call stack at this point should look similar to the one below:

libJavaScriptCore.so.1!JSC::LLInt::setProgramEntrypoint(JSC::CodeBlock * codeBlock) (/home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LLIntEntrypoint.cpp:112)
libJavaScriptCore.so.1!JSC::LLInt::setEntrypoint(JSC::CodeBlock * codeBlock) (/home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LLIntEntrypoint.cpp:161)
libJavaScriptCore.so.1!JSC::setupLLInt(JSC::CodeBlock * codeBlock) (/home/amar/workspace/WebKit/Source/JavaScriptCore/runtime/ScriptExecutable.cpp:395)
libJavaScriptCore.so.1!JSC::ScriptExecutable::prepareForExecutionImpl(JSC::ScriptExecutable * const this, JSC::VM & vm, JSC::JSFunction * function, JSC::JSScope * scope, JSC::CodeSpecializationKind kind, JSC::CodeBlock *& resultCodeBlock) (/home/amar/workspace/WebKit/Source/JavaScriptCore/runtime/ScriptExecutable.cpp:432)
libJavaScriptCore.so.1!JSC::ScriptExecutable::prepareForExecution<JSC::ProgramExecutable>(JSC::ScriptExecutable * const this, JSC::VM & vm, JSC::JSFunction * function, JSC::JSScope * scope, JSC::CodeSpecializationKind kind, JSC::CodeBlock *& resultCodeBlock) (/home/amar/workspace/WebKit/Source/JavaScriptCore/bytecode/CodeBlock.h:1086)
libJavaScriptCore.so.1!JSC::Interpreter::executeProgram(JSC::Interpreter * const this, const JSC::SourceCode & source, JSC::JSObject * thisObj) (/home/amar/workspace/WebKit/Source/JavaScriptCore/interpreter/Interpreter.cpp:816)
...

Within the function setProgramEntrypoint exists a call to getCodeRef which attempts to get a reference pointer to the executable address for opcode llint_program_prologue. This this reference pointer is where the interpreter (LLInt) beings execution for the CodeBlock.

ALWAYS_INLINE MacroAssemblerCodeRef<tag> getCodeRef(OpcodeID opcodeID)
{
    return MacroAssemblerCodeRef<tag>::createSelfManagedCodeRef(getCodePtr<tag>(opcodeID));
}

Once a reference pointer to llint_program_prologue has been retrieved, a NativeJITCode object is created which stores this code pointer and then initialises the codeBlock with a reference to the NativeJITCode object.

std::call_once(onceKey, [&] {
        jitCode = new NativeJITCode(getCodeRef<JSEntryPtrTag>(llint_program_prologue), JITType::InterpreterThunk, Intrinsic::NoIntrinsic, JITCode::ShareAttribute::Shared);
    });
codeBlock->setJITCode(makeRef(*jitCode));

Finally, the linked bytecode is ready to execute with a call to JITCode::execute. The function is as follows:

ALWAYS_INLINE JSValue JITCode::execute(VM* vm, ProtoCallFrame* protoCallFrame)
{
    //... code truncated for brevity

    void* entryAddress;
    entryAddress = addressForCall(MustCheckArity).executableAddress();
    JSValue result = JSValue::decode(vmEntryToJavaScript(entryAddress, vm, protoCallFrame));
    return scope.exception() ? jsNull() : result;
}

The key function in the snippet above is vmEntryToJavaScript which is a thunk defined in the LowLevelInterpreter.asm. The WebKit blog JavaScriptCore CSI: A Crash Site Investigation Story describes the thunk as follows:

vmEntryToJavaScript is implemented in LLInt assembly using the doVMEntry macro (see LowLevelInterpreter.asm and LowLevelInterpreter64.asm). The JavaScript VM enters all LLInt or JIT code via doVMEntry, and it will exit either via the end of doVMEntry (for normal returns), or via _handleUncaughtException (for exits due to uncaught exceptions).

At this point, execution transfers to the LLInt which now has a reference to the CodeBlock and entryAddress to begin execution from.

Implementation

Before proceeding any further into the details of vmEntryToJavaScript and doVMEntry, it will be worth the readers time to understand the custom assembly that the LLInt is written in. The LLInt is generated using what is referred to as offlineasm assembly. offlineasm is written in Ruby and can be found under JavaScriptCore/offlineasm. The LLInt itself is defined in LowLevelInterpreter.asm and LowLevelInterpreter64.asm.

The machine code generated that forms part of the LLInt is located in LLIntAssembly.h which can be found under the <webkit-folder>/WebKitBuild/Debug/DerivedSources/JavaScriptCore/ directory. This header file is generated at compile time by invoking offlineasm/asm.rb. This build step is listed in JavaSctipCore/CMakeLists.txt:

# The build system will execute asm.rb every time LLIntOffsetsExtractor's mtime is newer than
# LLIntAssembly.h's mtime. The problem we have here is: asm.rb has some built-in optimization
# that generates a checksum of the LLIntOffsetsExtractor binary, if the checksum of the new
# LLIntOffsetsExtractor matches, no output is generated. To make this target consistent and avoid
# running this command for every build, we artificially update LLIntAssembly.h's mtime (using touch)
# after every asm.rb run.
if (MSVC AND NOT ENABLE_C_LOOP)
    #... truncated for brevity
else ()
    set(LLIntOutput LLIntAssembly.h)
endif ()

add_custom_command(
    OUTPUT ${JavaScriptCore_DERIVED_SOURCES_DIR}/${LLIntOutput}
    MAIN_DEPENDENCY ${JSCCORE_DIR}/offlineasm/asm.rb
    DEPENDS LLIntOffsetsExtractor ${LLINT_ASM} ${OFFLINE_ASM} ${JavaScriptCore_DERIVED_SOURCES_DIR}/InitBytecodes.asm ${JavaScriptCore_DERIVED_SOURCES_DIR}/InitWasm.asm
    COMMAND ${CMAKE_COMMAND} -E env CMAKE_CXX_COMPILER_ID=${CMAKE_CXX_COMPILER_ID} GCC_OFFLINEASM_SOURCE_MAP=${GCC_OFFLINEASM_SOURCE_MAP} ${RUBY_EXECUTABLE} ${JAVASCRIPTCORE_DIR}/offlineasm/asm.rb -I${JavaScriptCore_DERIVED_SOURCES_DIR}/ ${JAVASCRIPTCORE_DIR}/llint/LowLevelInterpreter.asm $<TARGET_FILE:LLIntOffsetsExtractor> ${JavaScriptCore_DERIVED_SOURCES_DIR}/${LLIntOutput} ${OFFLINE_ASM_ARGS}
    COMMAND ${CMAKE_COMMAND} -E touch_nocreate ${JavaScriptCore_DERIVED_SOURCES_DIR}/${LLIntOutput}
    WORKING_DIRECTORY ${JavaScriptCore_DERIVED_SOURCES_DIR}
    VERBATIM)

# The explanation for not making LLIntAssembly.h part of the OBJECT_DEPENDS property of some of
# the .cpp files below is similar to the one in the previous comment. However, since these .cpp
# files are used to build JavaScriptCore itself, we can just add LLIntAssembly.h to JavaScript_HEADERS
# since it is used in the add_library() call at the end of this file.
if (MSVC AND NOT ENABLE_C_LOOP)
    #... truncated for brevity
else ()
    # As there's poor toolchain support for using `.file` directives in
    # inline asm (i.e. there's no way to avoid clashes with the `.file`
    # directives generated by the C code in the compilation unit), we
    # introduce a postprocessing pass for the asm that gets assembled into
    # an object file. We only need to do this for LowLevelInterpreter.cpp
    # and cmake doesn't allow us to introduce a compiler wrapper for a
    # single source file, so we need to create a separate target for it.
    add_library(LowLevelInterpreterLib OBJECT llint/LowLevelInterpreter.cpp
        ${JavaScriptCore_DERIVED_SOURCES_DIR}/${LLIntOutput})
endif ()

As the snippet above indicates, this generated header file is included in llint/LowLeveInterpreter.cpp which embeds the interpreter into JavaScriptCore.

// This works around a bug in GDB where, if the compilation unit
// doesn't have any address range information, its line table won't
// even be consulted. Emit {before,after}_llint_asm so that the code
// emitted in the top level inline asm statement is within functions
// visible to the compiler. This way, GDB can resolve a PC in the
// llint asm code to this compilation unit and the successfully look
// up the line number information.
DEBUGGER_ANNOTATION_MARKER(before_llint_asm)

// This is a file generated by offlineasm, which contains all of the assembly code
// for the interpreter, as compiled from LowLevelInterpreter.asm.
#include "LLIntAssembly.h"

DEBUGGER_ANNOTATION_MARKER(after_llint_asm)

The offlineasm compilation at a high-level functions as follows:

  1. asm.rb is invoked by supplying LowLevelInterpreter.asm and a target backend (i.e. cpu architecture) as input.
  2. The .asm files are lexed and parsed by the offlineasm parser defined in parser.rb
  3. Successful parsing generates an Abstract Syntax Tree (AST), the schema for which is defined in ast.rb
  4. The generated AST is then transformed (see tranform.rb) before it is lowered to the target backend.
  5. The nodes of the transformed AST are then traversed and machine code is emitted for each node. The machine code to be emitted for the different target backends is defined in its own ruby file. For example, the machine code for x86 is defined in x86.rb.
  6. The machine code emitted for the target backend is written to LLIntAssembly.h

This process of offlineasm compilation is very similar to the way JavaScriptCore generates bytecodes from supplied javascript source code. In this case however, the machine code is generated from the offlineasm assembly. A list of all offlineasm instruction and registers can be found in instructions.rb and registers.rb respectively. offlineasm supports multiple cpu architectures and these are referred to as backends. The various supported backends are listed in backends.rb.

The reader may be wondering why the LLInt is written in offlineasm rather than C/C++, which is what pretty much the rest of the engine is written in. A good discussion on this matter can be found in the How Profiled Execution Works section of the WebKit blogpost3 and explains the trade offs between using a custom assembly vs C/C++. The blog also describes two key features of offlineasm:

  • Portable assembly with our own mnemonics and register names that match the way we do portable assembly in our JIT. Some high-level mnemonics require lowering. Offlineasm reserves some scratch registers to use for lowering.

  • The macro construct. It’s best to think of this as a lambda that takes some arguments and returns void. Then think of the portable assembly statements as print statements that output that assembly. So, the macros are executed for effect and that effect is to produce an assembly program. These are the execution semantics of offlineasm at compile time.

Offlineasm

At this point the reader should now know how the LLInt is implemented and where to find the machine code that’s generated for it. This section discuses the language itself and how to go about reading it. The developer comments at the start of the LowLevelInterpreter.asm provide an introduction to the language and its definitely worth reading. This section will highlight the various constructs of the offlineasm language and provide examples from the codebase.

Macros

Most instructions are grouped as macros, which according to the developer comments are lambda expressions.

A “macro” is a lambda expression, which may be either anonymous or named. But this has caveats. “macro” can take zero or more arguments, which may be macros or any valid operands, but it can only return code. But you can do Turing-complete things via continuation passing style: “macro foo (a, b) b(a, a) end foo(foo, foo)”. Actually, don’t do that, since you’ll just crash the assembler.

The following snippet is an example of the dispatch macro.

macro dispatch(advanceReg)
    addp advanceReg, PC
    nextInstruction()
end

The macro above takes one argument, which in this case is the advanceReg. The macro body contains two instructions, the first is a call of the addp instruction which takes two operands advanceReg and PC. The second is a call to the macro nextInstruction().

Another important aspect to consider about macros is the scoping of arguments. The developer comments has the following to say on this matter:

Arguments to macros follow lexical scoping rather than dynamic scoping. Const’s also follow lexical scoping and may override (hide) arguments or other consts. All variables (arguments and constants) can be bound to operands. Additionally, arguments (but not constants) can be bound to macros.

Macros are not always labeled and can exist as anonymous macros. The snippet below is an example of an anonymous macro being used in llint_program_prologue which is the glue code that allows the LLInt to find the entrypoint to the linked bytecode in the supplied code block:

op(llint_program_prologue, macro ()
    prologue(notFunctionCodeBlockGetter, notFunctionCodeBlockSetter, _llint_entry_osr, _llint_trace_prologue)
    dispatch(0)
end)

Instructions

The instructions in offlineasm generally follow GNU Assembler syntax. The developer comments for instructions are as follows:

Mostly gas-style operand ordering. The last operand tends to be the destination. So “a := b” is written as “mov b, a”. But unlike gas, comparisons are in-order, so “if (a < b)” is written as “bilt a, b, …”.

In the snippet below, the move instruction takes two operands lr and destinationRegister. The value in lr is moved to destinationRegister

move lr, destinationRegister

Some instructions will also contain postfixes which provide additional information on the behaviour of the instruction. The various postfixes that can be added to instructions are documented in the developer comment below:

“b” = byte, “h” = 16-bit word, “i” = 32-bit word, “p” = pointer. For 32-bit, “i” and “p” are interchangeable except when an op supports one but not the other.

In the snippet below, the instruction add which is postfixed with p, indicating this is a pointer addition operation where the value of advanceReg is added to PC.

macro dispatch(advanceReg)
    addp advanceReg, PC
    nextInstruction()
end

Operands

Instructions take one or more operands. A note on operands for instructions and macros from the developer comments is as follows:

In general, valid operands for macro invocations and instructions are registers (eg “t0”), addresses (eg “4[t0]”), base-index addresses (eg “7[t0, t1, 2]”), absolute addresses (eg “0xa0000000[]”), or labels (eg “_foo” or “.foo”). Macro invocations can also take anonymous macros as operands. Instructions cannot take anonymous macros.

The following snippet, shows some of the various operand types in use (i.e. registers, addresses, base-index addresses and labels):

.copyLoop:
    if ARM64 and not ADDRESS64
        subi MachineRegisterSize, temp2
        loadq [sp, temp2, 1], temp3
        storeq temp3, [temp1, temp2, 1]
        btinz temp2, .copyLoop
    else
        subi PtrSize, temp2
        loadp [sp, temp2, 1], temp3
        storep temp3, [temp1, temp2, 1]
        btinz temp2, .copyLoop
    end

    move temp1, sp
    jmp callee, callPtrTag
end

Registers

Some notes on the various registers in use by offlineasm, these have been reproduced from the developer comments.

cfr and sp hold the call frame and (native) stack pointer respectively. They are callee-save registers, and guaranteed to be distinct from all other registers on all architectures.

t0, t1, t2, t3, t4, and optionally t5, t6, and t7 are temporary registers that can get trashed on calls, and are pairwise distinct registers. t4 holds the JS program counter, so use with caution in opcodes (actually, don’t use it in opcodes at all, except as PC).

r0 and r1 are the platform’s customary return registers, and thus are two distinct registers

a0, a1, a2 and a3 are the platform’s customary argument registers, and thus are pairwise distinct registers. Be mindful that:

  • On X86, there are no argument registers. a0 and a1 are edx and ecx following the fastcall convention, but you should still use the stack to pass your arguments. The cCall2 and cCall4 macros do this for you.

There are additional assumptions and platform specific details about some of these registers that the reader is welcome to explore.

Labels

Labels are much like goto statements in C/C++ and the developer notes on labels has the following to say:

Labels must have names that begin with either “” or “.”. A “.” label is local and gets renamed before code gen to minimize namespace pollution. A “” label is an extern symbol (i.e. “.globl”). The “” may or may not be removed during code gen depending on whether the asm conventions for C name mangling on the target platform mandate a “” prefix.

The snippet below shows an examples of local labels (i.e. .afterHandlingTraps, .handleTraps, etc) in use:

llintOp(op_check_traps, OpCheckTraps, macro (unused, unused, dispatch)
    loadp CodeBlock[cfr], t1
    loadp CodeBlock::m_vm[t1], t1
    loadb VM::m_traps+VMTraps::m_needTrapHandling[t1], t0
    btpnz t0, .handleTraps
.afterHandlingTraps:
    dispatch()
.handleTraps:
    callTrapHandler(.throwHandler)
    jmp .afterHandlingTraps
.throwHandler:
    jmp _llint_throw_from_slow_path_trampoline
end)

An example of global labels (i.e. “_” labels) is shown in the snippet below:

if C_LOOP or C_LOOP_WIN
    _llint_vm_entry_to_javascript:
else
    global _vmEntryToJavaScript
    _vmEntryToJavaScript:
end
    doVMEntry(makeJavaScriptCall)

Global labels have a global scope and can be referenced anywhere in the assembly whereas Local labels are scoped to a macro and can only be referenced within the macro that defines them.

Conditional Statements

Another interesting construct in the previous snippet is the if statement. The developer comments on if statements are as follows:

An “if” is a conditional on settings. Any identifier supplied in the predicate of an “if” is assumed to be a #define that is available during code gen. So you can’t use “if” for computation in a macro, but you can use it to select different pieces of code for different platforms.

The snippet below shows an example of an if statement

if C_LOOP or C_LOOP_WIN or ARMv7 or ARM64 or ARM64E or MIPS
   # In C_LOOP or C_LOOP_WIN case, we're only preserving the bytecode vPC.
   move lr, destinationRegister
elsif X86 or X86_WIN or X86_64 or X86_64_WIN
    pop destinationRegister
else
    error
end

The predicates within the if statements, i.e. C_LOOP, ARM64, X86, etc are defined in the JavaScriptCore codebase and effectively perform the same function as #ifdef statements in C/C++.

Const Expressions

Const expressions allow offlineasm to define constant values to be used by the assembly or reference values implemented by the JIT ABI. The ABI references are translated to offsets at compile time by the offlineasm interpreter. An example of const declarations is shown in the snippet below:

# These declarations must match interpreter/JSStack.h.

const PtrSize = constexpr (sizeof(void*))
const MachineRegisterSize = constexpr (sizeof(CPURegister))
const SlotSize = constexpr (sizeof(Register))

if JSVALUE64
    const CallFrameHeaderSlots = 5
else
    const CallFrameHeaderSlots = 4
    const CallFrameAlignSlots = 1
end

The values PtrSize, MachineRegisterSize and SlotSize are determined at compile time when the relevant expressions are evaluated. The values of CPUResiter and Register are defined in stdlib.h for the target architecture. The CallFrameHeaderSlots and CallFrameAlignSlots are constant values that are referenced in LowLevelInterpreter.asm.

Tracing Execution

JavaScriptCore allows two commandline flags to enable tracing execution within the LLInt. These are traceLLIntExecution and traceLLIntSlowPath. However, in order to use these flags one would need to enable LLInt tracing in the LLInt configuration. This is achieved by setting LLINT_TRACING in LLIntCommon.h:

// Enables LLINT tracing.
// - Prints every instruction executed if Options::traceLLIntExecution() is enabled.
// - Prints some information for some of the more subtle slow paths if
//   Options::traceLLIntSlowPath() is enabled.
#define LLINT_TRACING 1

The two flags can now be added the run configuration launch.json or on the commandline. Here’s what launch.json should look like:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/jsc",
            "args": ["--reportBytecodeCompileTimes=true", "--dumpGeneratedBytecodes=true", "--useJIT=false", "--traceLLIntExecution=true","--traceLLIntSlowPath=true", "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/test.js"],

//... truncated for brevity

        }
    ]
}

Let’s quickly revisit our test script:

$ cat test.js
let x = 10;
let y = 20;
let z = x + y;

and the bytecodes generated for it:

<global>#AmfQ2h:[0x7fffee3bc000->0x7fffeedcb848, NoneGlobal, 96]: 18 instructions (0 16-bit instructions, 0 32-bit instructions, 11 instructions with metadata); 216 bytes (120 metadata bytes); 1 parameter(s); 12 callee register(s); 6 variable(s); scope at loc4

bb#1
[   0] enter              
[   1] get_scope          loc4
[   3] mov                loc5, loc4
[   6] check_traps        
[   7] mov                loc6, Undefined(const0)
[  10] resolve_scope      loc7, loc4, 0, GlobalProperty, 0
[  17] put_to_scope       loc7, 0, Int32: 10(const1), 1048576<DoNotThrowIfNotFound|GlobalProperty|Initialization|NotStrictMode>, 0, 0
[  25] resolve_scope      loc7, loc4, 1, GlobalProperty, 0
[  32] put_to_scope       loc7, 1, Int32: 20(const2), 1048576<DoNotThrowIfNotFound|GlobalProperty|Initialization|NotStrictMode>, 0, 0
[  40] resolve_scope      loc7, loc4, 2, GlobalProperty, 0
[  47] resolve_scope      loc8, loc4, 0, GlobalProperty, 0
[  54] get_from_scope     loc9, loc8, 0, 2048<ThrowIfNotFound|GlobalProperty|NotInitialization|NotStrictMode>, 0, 0
[  62] mov                loc8, loc9
[  65] resolve_scope      loc9, loc4, 1, GlobalProperty, 0
[  72] get_from_scope     loc10, loc9, 1, 2048<ThrowIfNotFound|GlobalProperty|NotInitialization|NotStrictMode>, 0, 0
[  80] add                loc8, loc8, loc10, OperandTypes(126, 126)
[  86] put_to_scope       loc7, 2, loc8, 1048576<DoNotThrowIfNotFound|GlobalProperty|Initialization|NotStrictMode>, 0, 0
[  94] end                loc6
Successors: [ ]


Identifiers:
  id0 = x
  id1 = y
  id2 = z

Constants:
   k0 = Undefined
   k1 = Int32: 10: in source as integer
   k2 = Int32: 20: in source as integer

With the LLInt tracing enabled and the --traceLLIntExecution=true flag passed to the jsc shell via the commandline, allows dumping of the execution trace for each bytecode to stdout:

<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: in prologue of <global>#AmfQ2h:[0x7fffee3bc000->0x7fffeedcb848, LLIntGlobal, 96]
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#0, op_enter, pc = 0x7fffeedf49c0
Frame will eventually return to 0x7ffff4e61403
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#1, op_get_scope, pc = 0x7fffeedf49c1
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#3, op_mov, pc = 0x7fffeedf49c3
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#6, op_check_traps, pc = 0x7fffeedf49c6
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#7, op_mov, pc = 0x7fffeedf49c7
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#10, op_resolve_scope, pc = 0x7fffeedf49ca
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#17, op_put_to_scope, pc = 0x7fffeedf49d1
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#25, op_resolve_scope, pc = 0x7fffeedf49d9
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#32, op_put_to_scope, pc = 0x7fffeedf49e0
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#40, op_resolve_scope, pc = 0x7fffeedf49e8
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#47, op_resolve_scope, pc = 0x7fffeedf49ef
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#54, op_get_from_scope, pc = 0x7fffeedf49f6
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#62, op_mov, pc = 0x7fffeedf49fe
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#65, op_resolve_scope, pc = 0x7fffeedf4a01
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#72, op_get_from_scope, pc = 0x7fffeedf4a08
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#80, op_add, pc = 0x7fffeedf4a10
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#86, op_put_to_scope, pc = 0x7fffeedf4a16
<0x7fffeedff000> 0x7fffee3bc000 / 0x7fffffffcb60: executing bc#94, op_end, pc = 0x7fffeedf4a1e

Lets return to our discussion on vmEntryToJavaScript that was left off from in the recap section. As stated previously, this is the entry point into the LLInt. This is defined as a global label within LowLevelInterpreter.asm as follows:

# ... asm truncated for brevity 

    global _vmEntryToJavaScript
    _vmEntryToJavaScript:

    doVMEntry(makeJavaScriptCall)

This effectively calls the macro, doVMEntry with the macro makeJavaScriptCall passed as an argument. These two macros are defined in LowLevelInterpreter64.asm.

The macro doVMEntry does a number of actions before it calls the macro makeJavaScriptCall. These actions are setting up the function prologue, saving register state, checking stack pointer alignment, adding a VMEntryRecord and setting up the stack with arguments for the call to makeJavaScriptCall. A truncated assembly snippet is shown below:

macro doVMEntry(makeCall)
    functionPrologue()
    pushCalleeSaves()

    const entry = a0
    const vm = a1
    const protoCallFrame = a2

    vmEntryRecord(cfr, sp)

    checkStackPointerAlignment(t4, 0xbad0dc01)

    //... assembly truncated for brevity

    checkStackPointerAlignment(extraTempReg, 0xbad0dc02)

    makeCall(entry, protoCallFrame, t3, t4)     <-- call to makeJavaScriptCall which initiates bytecode execution

    checkStackPointerAlignment(t2, 0xbad0dc03)

    vmEntryRecord(cfr, t4)

    //... assembly truncated for brevity

    subp cfr, CalleeRegisterSaveSize, sp

    popCalleeSaves()
    functionEpilogue()
    ret

//... assembly truncated for brevity

end

When the call to makeJavaScriptCall returns, it performs actions to once again check stack alignment, update the VMEntryRecord, restore saved registers and invoke the function epilogue macro before returning control to its caller. makeCall in the snippet above invokes makeJavaScriptCall which is defined as follows:

# a0, a2, t3, t4
macro makeJavaScriptCall(entry, protoCallFrame, temp1, temp2)
    addp 16, sp
    //... assembly truncated for brevity
    call entry, JSEntryPtrTag

    subp 16, sp
end

The call parameter entry here refers to the glue code llint_program_prologue that was set during the LLInt setup stage. This glue code is defined in LowLevelInterpreter.asm as follows:

op(llint_program_prologue, macro ()
    prologue(notFunctionCodeBlockGetter, notFunctionCodeBlockSetter, _llint_entry_osr, _llint_trace_prologue)
    dispatch(0)
end)

This glue code when compiled by offlineasm gets emitted in LLIntAssembly.h, a truncated snippet of which is shown below:

OFFLINE_ASM_GLUE_LABEL(llint_program_prologue)
".loc 1 1346\n"
    // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter.asm:1346
    // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter.asm:777
".loc 1 777\n"
    "\tpush %rbp\n"
".loc 1 783\n"
    "\tmovq %rsp, %rbp\n"                                    // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter.asm:783

    //... code truncated for brevity

At runtime this resolves to the address of the <llint_op_enter_wide32+5>. This can also be seen by setting a breakpoint within JITCode::execute and inspecting the value at entryAddress. The screenshot below show the the value of entryAddress at runtime and the instruction dump at that address.

entry-address

Another handy feature of vscode is enabling the Allow Breakpoints Everywhere setting in vscode. This will allow setting breakpoints directly in LowLevelInterpreter.asm and LowLevelInterpreter64.asm. This will save a bit of time rather than having to set breakpoints in gdb to break in the LLInt.

allow-bp-everywhere

This would now allow source-level debugging in offlineasm source files. However, this isn’t a foolproof method as vscode is unable to resolve branching instructions that rely on indirect address resolution. A good example of this is the call instruction to entry in makeJavaScriptCall.

call entry, JSEntryPtrTag

the jump address for entry is stored in register rdi and a breakpoint would need to be set manually at the address pointed to by rdi. In the screenshot below, the debugger pauses execution at call entry, JSEntryPtrTag and from within gdb allows listing of the current instruction the debugger is stopped at and the instructions that execution would jump to:

instruction-level-debugging

Fortunately, the WebKit developers have added a line table to LLIntAssembly.h, which allows the debugger to cross-reference the LowLevelInterpreter.asm and LowLevelInterpreter64.asm while stepping through instructions in gdb. The screenshot below is an example of what it would look like stepping through llint_op_enter_wide32:

line-table-example

With the locations in the offlineasm source files specified in the line table and the ability to set breakpoints directly at points of interest in the .asm files, one can continue debugging at the sourcecode level. The screenshot below shows an example of what that would look like in vscode:

asm-source-level-debug

At this point one should now be able to enable execution tracing in the LLInt and setup breakpoints in offlineasm source files. Also discussed was how the jump to the llint_program_prologue glue code initiates the execution of bytecode.

This section will now discuss the LLInt execution loop which iterates over the bytecodes and executes them sequentially. The high level overview of this execution loop is as follows:

  1. call dispatch with an optional argument that would increment PC
  2. Increment PC with the argument passed to dispatch
  3. Lookup the opcode map and fetch the address of the llint opcode label for the corresponding bytecode to be executed
  4. Jump to the llint opcode label that which is the start of the bytecode instructions to be executed
  5. Once execution has completed, repeat #1

Let’s look a this loop in more detail by examining the execution of opcode mov which is at bytecode bc#3.

[   3] mov                loc5, loc4

Begin by setting a breakpoint at the start of the opcode macro dispatch definition in LowLevelInterpreter.asm. When the call to dispatch is made, advanceReg contains the value 0x2. PC currently points to bytecode bc#1and this is incremented by adding 0x2 to point to bytecode bc#3 which is the mov bytecode in our bytecode dump.

macro dispatch(advanceReg)
    addp advanceReg, PC
    nextInstruction()
end

with PC incremented and pointing to bc#3, a call to nextInstruction() is made. The macro nextInstruction looks up the _g_opcodeMap to find the opcode implementation in the LLInt. Once the jmp in nextInstruction() is taken, execution controls ends up in the LLInt assembly for llint_op_mov.

macro nextInstruction()
    loadb [PB, PC, 1], t0
    leap _g_opcodeMap, t1
    jmp [t1, t0, PtrSize], BytecodePtrTag
end

The bytecode opcodes are implemented in this section of the LowLevelInterpreter64.asm and are referenced via the llint opcode labels which are defined in LLIntAssembly.h. An example of this is the mov opcode which is referenced by the label op_mov:


llintOpWithReturn(op_mov, OpMov, macro (size, get, dispatch, return)
    get(m_src, t1)
    loadConstantOrVariable(size, t1, t2)
    return(t2)
end)

And the corresponding definition in the LLIntAssembly.h is as follows:

OFFLINE_ASM_OPCODE_LABEL(op_mov)
".loc 3 358\n"
    "\taddq %r13, %r8\n"                                     // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:358
".loc 3 368\n"
    "\tmovq %rbp, %rdi\n"                                    // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:368
".loc 3 369\n"
    "\tmovq %r8, %rsi\n"                                     // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:369
".loc 1 704\n"
    "\tmovq %rsp, %r8\n"                                     // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter.asm:704
    "\tandq $15, %r8\n"

//... truncated for brevity

".loc 3 513\n"
    "\tcmpq $16, %rsi\n"                                     // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:513
    "\tjge " LOCAL_LABEL_STRING(_offlineasm_llintOpWithReturn__llintOp__commonOp__fn__fn__makeReturn__fn__fn__loadConstantOrVariable__size__k__57_load__constant) "\n"
".loc 3 514\n"
    "\tmovq 0(%rbp, %rsi, 8), %rdx\n"                        // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:514
".loc 3 515\n"
    "\tjmp " LOCAL_LABEL_STRING(_offlineasm_llintOpWithReturn__llintOp__commonOp__fn__fn__makeReturn__fn__fn__loadConstantOrVariable__size__k__57_load__done) "\n" // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:515

  OFFLINE_ASM_LOCAL_LABEL(_offlineasm_llintOpWithReturn__llintOp__commonOp__fn__fn__makeReturn__fn__fn__loadConstantOrVariable__size__k__57_load__constant)
".loc 3 489\n"
    "\tmovq 16(%rbp), %rdx\n"                                // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:489
".loc 3 490\n"
    "\tmovq 176(%rdx), %rdx\n"                               // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:490
".loc 3 491\n"
    "\tmovq -128(%rdx, %rsi, 8), %rdx\n"                     // /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:491

  OFFLINE_ASM_LOCAL_LABEL(_offlineasm_llintOpWithReturn__llintOp__commonOp__fn__fn__makeReturn__fn__fn__loadConstantOrVariable__size__k__57_load__done)

//... truncated for brevity

One can also verify this by dumping the assembly from the debugger:

Dump of assembler code for function llint_op_mov:
   0x00007ffff4e65b4f <+0>: add    r8,r13
   0x00007ffff4e65b52 <+3>: mov    rdi,rbp
   0x00007ffff4e65b55 <+6>: mov    rsi,r8
   0x00007ffff4e65b58 <+9>: mov    r8,rsp
   0x00007ffff4e65b5b <+12>:    and    r8,0xf
   0x00007ffff4e65b5f <+16>:    test   r8,r8
   0x00007ffff4e65b62 <+19>:    je     0x7ffff4e65b6f <llint_op_mov+32>
   0x00007ffff4e65b64 <+21>:    movabs r8,0xbad0c002
   0x00007ffff4e65b6e <+31>:    int3   
   0x00007ffff4e65b6f <+32>:    call   0x7ffff5e89bd1 <JavaScript::LLInt::llint_trace(JavaScript::CallFrame*, JavaScript::Instruction const*)>
   0x00007ffff4e65b74 <+37>:    mov    r8,rax
   0x00007ffff4e65b77 <+40>:    sub    r8,r13
   0x00007ffff4e65b7a <+43>:    movsx  rsi,BYTE PTR [r13+r8*1+0x2]
   0x00007ffff4e65b80 <+49>:    cmp    rsi,0x10
   0x00007ffff4e65b84 <+53>:    jge    0x7ffff4e65b8d <llint_op_mov+62>
   0x00007ffff4e65b86 <+55>:    mov    rdx,QWORD PTR [rbp+rsi*8+0x0]
   0x00007ffff4e65b8b <+60>:    jmp    0x7ffff4e65b9d <llint_op_mov+78>
   0x00007ffff4e65b8d <+62>:    mov    rdx,QWORD PTR [rbp+0x10]
   0x00007ffff4e65b91 <+66>:    mov    rdx,QWORD PTR [rdx+0xb0]
   0x00007ffff4e65b98 <+73>:    mov    rdx,QWORD PTR [rdx+rsi*8-0x80]
   0x00007ffff4e65b9d <+78>:    movsx  rsi,BYTE PTR [r13+r8*1+0x1]
   0x00007ffff4e65ba3 <+84>:    mov    QWORD PTR [rbp+rsi*8+0x0],rdx
   0x00007ffff4e65ba8 <+89>:    add    r8,0x3
   0x00007ffff4e65bac <+93>:    movzx  eax,BYTE PTR [r13+r8*1+0x0]
   0x00007ffff4e65bb2 <+99>:    mov    rsi,QWORD PTR [rip+0x2e2264f]        # 0x7ffff7c88208
   0x00007ffff4e65bb9 <+106>:   jmp    QWORD PTR [rsi+rax*8]
   0x00007ffff4e65bbc <+109>:   int3   
   0x00007ffff4e65bbd <+110>:   int3   
   0x00007ffff4e65bbe <+111>:   add    al,0x2
   0x00007ffff4e65bc0 <+113>:   add    BYTE PTR [rax],al
End of assembler dump.

Fast Path/Slow Path

An important aspect of the LLInt is the concept and implementation of fast and slow paths. The LLInt, by design, is meant to generate fast code with a low latency as possible. The code generated, as seen earlier in this blog post, is typically machine code (e.g. x86 assembly) that implements bytecode operations. However, when executing bytecode the LLInt needs to determine the types of operands it receives with an opcode in order to pick the right execution path. For example consider the the following js code:

let x = 10; 
let y = 20;
let z = x+y;

The LLInt when when it executes the add opcode, it will check if the operands passed to it (i.e. x and y) are integers and if so it can implement the addition operation directly in machine code. Now consider the following js code:

let x = 10;
let y = {a : "Ten"};
let z = x+y; 

The LLInt can no longer perform addition directly in machine code since the types of x and y are different. In this instance, the LLInt will take slow path of execution which is a call to C++ code to handle cases were there is an operand type mismatch. This is a simplified explanation on how fast and slow paths work and the add opcode in particular has several other checks on its operands in addition to integer checks.

Additionally, when the LLInt needs to call into C++ code, it makes a call to a slow path which is essentially a trampoline into C++. If you’ve been debugging along, you may have noticed calls to callSlowPath or cCall2 as one steps through the execution in the LLInt most of which has been calls to the tracing function which is implemented in C++.

Lets now attempt to debug execution in a slow path. For this exercise the following js program is used:

let x = 10;
let y = "Ten";
x === y;

Which generates the following bytecode dump:

bb#1
[   0] enter              
[   1] get_scope          loc4
[   3] mov                loc5, loc4
[   6] check_traps        
[   7] mov                loc6, Undefined(const0)
[  10] resolve_scope      loc7, loc4, 0, GlobalProperty, 0
[  17] put_to_scope       loc7, 0, Int32: 10(const1), 1048576<DoNotThrowIfNotFound|GlobalProperty|Initialization|NotStrictMode>, 0, 0
[  25] resolve_scope      loc7, loc4, 1, GlobalProperty, 0
[  32] put_to_scope       loc7, 1, String (atomic),8Bit:(1),length:(3): Ten, StructureID: 22247(const2), 1048576<DoNotThrowIfNotFound|GlobalProperty|Initialization|NotStrictMode>, 0, 0
[  40] mov                loc6, Undefined(const0)
[  43] resolve_scope      loc7, loc4, 0, GlobalProperty, 0
[  50] get_from_scope     loc8, loc7, 0, 2048<ThrowIfNotFound|GlobalProperty|NotInitialization|NotStrictMode>, 0, 0
[  58] mov                loc7, loc8
[  61] resolve_scope      loc8, loc4, 1, GlobalProperty, 0
[  68] get_from_scope     loc9, loc8, 1, 2048<ThrowIfNotFound|GlobalProperty|NotInitialization|NotStrictMode>, 0, 0
[  76] stricteq           loc6, loc7, loc9
[  80] end                loc6
Successors: [ ]

Identifiers:
  id0 = x
  id1 = y

Constants:
   k0 = Undefined
   k1 = Int32: 10: in source as integer
   k2 = String (atomic),8Bit:(1),length:(3): Ten, StructureID: 55445

The bytecode of interest is the stricteq opcode at bc#76. This opcode is defined in LowLevelInterpreter64.asm as follows:

macro strictEqOp(opcodeName, opcodeStruct, createBoolean)
    llintOpWithReturn(op_%opcodeName%, opcodeStruct, macro (size, get, dispatch, return)
        get(m_rhs, t0)
        get(m_lhs, t2)
        loadConstantOrVariable(size, t0, t1)
        loadConstantOrVariable(size, t2, t0)

        # At a high level we do
        # If (left is Double || right is Double)
        #     goto slowPath;
        # result = (left == right);
        # if (result)
        #     goto done;
        # if (left is Cell || right is Cell)
        #     goto slowPath;
        # done:
        # return result;

        # This fragment implements (left is Double || right is Double), with a single branch instead of the 4 that would be naively required if we used branchIfInt32/branchIfNumber
        # The trick is that if a JSValue is an Int32, then adding 1<<49 to it will make it overflow, leaving all high bits at 0
        # If it is not a number at all, then 1<<49 will be its only high bit set
        # Leaving only doubles above or equal 1<<50.
        move t0, t2
        move t1, t3
        move LowestOfHighBits, t5
        addq t5, t2
        addq t5, t3
        orq t2, t3
        lshiftq 1, t5
        bqaeq t3, t5, .slow

        cqeq t0, t1, t5
        btqnz t5, t5, .done #is there a better way of checking t5 != 0 ?

        move t0, t2
        # This andq could be an 'or' if not for BigInt32 (since it makes it possible for a Cell to be strictEqual to a non-Cell)
        andq t1, t2
        btqz t2, notCellMask, .slow

    .done:
        createBoolean(t5)
        return(t5)

    .slow:
        callSlowPath(_slow_path_%opcodeName%)
        dispatch()
    end)
end

One can set a breakpoint at this macro definition and step through the execution of this opcode. There are two reasons to pick this particular opcode, one being that it’s execution paths are simple to follow and don’t introduce unnecessary complexity and the second being that it comes with helpful developer comments to help the reader follow along.

Stepping through the execution, observe that the checks for numbers (i.e. integers and doubles) fails and execution control end up in the section that checks for JSCell headers. The rhs of the stricteq operation passes the isCell check and as a result execution jumps to the label .slow which calls the slow path:

//... truncated for brevity
        
        move t0, t2
        # This andq could be an 'or' if not for BigInt32 (since it makes it possible for a Cell to be strictEqual to a non-Cell)
        andq t1, t2
        btqz t2, notCellMask, .slow

 //... truncated for brevity

    .slow:
        callSlowPath(_slow_path_%opcodeName%)

//... truncated for brevity

Stepping into the call to callSlowPath leads to the C++ implementation of _slow_path_stricteq defined in CommonSlowPaths.cpp:

JSC_DEFINE_COMMON_SLOW_PATH(slow_path_stricteq)
{
    BEGIN();
    auto bytecode = pc->as<OpStricteq>();
    RETURN(jsBoolean(JSValue::strictEqual(globalObject, GET_C(bytecode.m_lhs).jsValue(), GET_C(bytecode.m_rhs).jsValue())));
}

This stub function retrieves the operand values and passes it to the function JSValue::strictEqual which is defined as follows:

inline bool JSValue::strictEqual(JSGlobalObject* globalObject, JSValue v1, JSValue v2)
{
    if (v1.isInt32() && v2.isInt32())
        return v1 == v2;

    if (v1.isNumber() && v2.isNumber())
        return v1.asNumber() == v2.asNumber();

#if USE(BIGINT32)
    if (v1.isHeapBigInt() && v2.isBigInt32())
        return v1.asHeapBigInt()->equalsToInt32(v2.bigInt32AsInt32());
    if (v1.isBigInt32() && v2.isHeapBigInt())
        return v2.asHeapBigInt()->equalsToInt32(v1.bigInt32AsInt32());
#endif

    if (v1.isCell() && v2.isCell())
        return strictEqualForCells(globalObject, v1.asCell(), v2.asCell());

    return v1 == v2;
}

This function is responsible for performing all the various slower checks to determine if the lhs is strictly equal to the rhs. This concludes our discussion on the fast path/slow path pattern that’s common across all JIT tiers in JavaScriptCore.

Tiering Up

When bytecode has been executed a certain number of times in the LLInt, the bytecodes gets profiled as being warm code and after an execution threshold is reached, the warm code is now considered hot and the LLInt can now tier up to a higher JIT tier. In this case JavaScriptCore would tier up into the Baseline JIT. The graph reproduced4 below shows a timeline on how the tiering up process functions:

pipeline-execution

The Control section of the WebKit blog3 describes the three main heuristics that are used by the various JIT tiers to determine thresholds for tiering up. These heuristics are execution counts for function calls and loop execution, count exits to count the number of times a function compiled by the optimising JIT tiers exits to a lower tier and recompilation counts which keeps track of the number of times a function is jettisoned to a lower tier.

The LLInt mainly uses execution counts to determine if a function or loop is hot and if the execution of this code should be tiered up. The execution counter in the LLInt utilises the following rules3 to calculate the threshold to tier up:

  • Each call to the function adds 15 points to the execution counter.
  • Each loop execution adds 1 point to the execution counter.

There are two threshold values used by the LLInt for execution counting. The static value of 500 points is used when no other information about the bytecodes execution or JIT status is captured. As the bytecode executes in the LLInt, tiers up and down, the engine generates a dynamic profile for the threshold value. The excerpt below3 describes how dynamic threshold counts are determined in the LLInt:

Over the years we’ve found ways to dynamically adjust these thresholds based on other sources of information, like:

  • Whether the function got JITed the last time we encountered it (according to our cache). Let’s call this wasJITed.
  • How big the function is. Let’s call this S. We use the number of bytecode opcodes plus operands as the size.
  • How many times it has been recompiled. Let’s call this R.
  • How much executable memory is available. Let’s use M to say how much executable memory we have total, and U is the amount we estimate that we would use (total) if we compiled this function.
  • Whether profiling is “full” enough.

We select the LLInt→Baseline threshold based on wasJITed. If we don’t know (the function wasn’t in the cache) then we use the basic threshold, 500. Otherwise, if the function wasJITed then we use 250 (to accelerate tier-up) otherwise we use 2000.

The values of S, R, M and U aren’t used by the LLInt to calculate a dynamic thresholds for tiering up but this will become relevant when exploring the optimising tiers later on in this blog series. The static execution counter thresholds are defined in OptionsList.h. The snippet below shows the values for LLInt→Baseline thresholds:

v(Int32, thresholdForJITAfterWarmUp, 500, Normal, nullptr) \
v(Int32, thresholdForJITSoon, 100, Normal, nullptr) \
\
//... code truncated for brevity
v(Int32, executionCounterIncrementForLoop, 1, Normal, nullptr) \
v(Int32, executionCounterIncrementForEntry, 15, Normal, nullptr) \

The execution counters that track these values are defined in ExecutionCounter.h. The snippet below shows the three key counters that are referenced and updated by the LLInt.

    // This counter is incremented by the JIT or LLInt. It starts out negative and is
    // counted up until it becomes non-negative. At the start of a counting period,
    // the threshold we wish to reach is m_totalCount + m_counter, in the sense that
    // we will add X to m_totalCount and subtract X from m_counter.
    int32_t m_counter;

    // Counts the total number of executions we have seen plus the ones we've set a
    // threshold for in m_counter. Because m_counter's threshold is negative, the
    // total number of actual executions can always be computed as m_totalCount +
    // m_counter.
    float m_totalCount;

    // This is the threshold we were originally targeting, without any correction for
    // the memory usage heuristics.
    int32_t m_activeThreshold;

Each CodeBlock parsed by the engine instantiates two ExecutionCounter objects. These are the m_llintExecuteCounter and the m_jitExecuteCounter. The m_llintExecuteCounter is most relevant for this blog post as it determines the threshold to tier up into the Baseline JIT.

    BaselineExecutionCounter m_llintExecuteCounter;

    BaselineExecutionCounter m_jitExecuteCounter;

With the understanding of how thresholds work, lets trace this in the code base to better understand this behaviour. To being, enable the Baseline JIT in launch.json to allow the LLInt to tier up and at the same time ensure that the optimising tiers are disabled. This is done by removing the --useJIT=false flag and adding the --useDFGJIT=false flag to the commandline arguments. The launch.json should look as follows:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/jsc",
            "args": ["--reportCompileTimes=true", "--dumpGeneratedBytecodes=true", "--useDFGJIT=false", "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/test.js"],
            //... truncated for brevity
        }
    ]
}

In addition, also add the --reportCompileTimes=true flag to log a notification to stdout when CodeBlock is compiled by the Baseline JIT and other tiers. Now that the debugging environment has been updated, lets create the following test script to trigger baseline compilation:

$ cat test.js

function jitMe(x,y){
    return x+y;
}

let x = 1;

for(let y = 0; y < 300; y++){
    jitMe(x,y)
}

In the javascript program above, our goal is to attempt to execute the function jitMe over several iterations of the for-loop in order for it to be optimised by the Baseline JIT. The LLInt determines when a function/codeblock should be optimised with the call the macro checkSwitchToJIT:

macro checkSwitchToJIT(increment, action)
    loadp CodeBlock[cfr], t0
    baddis increment, CodeBlock::m_llintExecuteCounter + BaselineExecutionCounter::m_counter[t0], .continue
    action()
    .continue:
end

Setting a breakpoint at this macro allows examining the counter values in the debugger. Pausing execution at this breakpoint and listing of instructions is shown below:

Thread 1 "jsc" hit Breakpoint 3, llint_op_ret () at /home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter.asm:1273
1273        baddis increment, CodeBlock::m_llintExecuteCounter + BaselineExecutionCounter::m_counter[t0], .continue
-exec x/4i $rip
=> 0x7ffff4e71e1c <llint_op_ret+47>:    add    DWORD PTR [rax+0xe8],0xa
   0x7ffff4e71e23 <llint_op_ret+54>:    js     0x7ffff4e71e50 <llint_op_ret+99>
   0x7ffff4e71e25 <llint_op_ret+56>:    add    r8,r13
   0x7ffff4e71e28 <llint_op_ret+59>:    mov    rdi,rbp

The memory address pointed to by rax+0xe8 is the value of m_counter which has a value of -500 and is incremented by a value of 15 (0xa). When this value reaches zero, it triggers Baseline optimisation with the call to action. Allowing the program to continue execution in our debugger and run to completion, generates the following output:

<global>#CLzrku:[0x7fffae3c4000->0x7fffeedcb768, NoneGlobal, 116]: 28 instructions (0 16-bit instructions, 0 32-bit instructions, 10 instructions with metadata); 236 bytes (120 metadata bytes); 1 parameter(s); 18 callee register(s); 6 variable(s); scope at loc4

bb#1
[   0] enter              
[   1] get_scope          loc4
#... truncated for brevity

jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, NoneFunctionCall, 15]: 6 instructions (0 16-bit instructions, 0 32-bit instructions, 1 instructions with metadata); 135 bytes (120 metadata bytes); 3 parameter(s); 8 callee register(s); 6 variable(s); scope at loc4

bb#1
[   0] enter              
[   1] get_scope          loc4
[   3] mov                loc5, loc4
[   6] check_traps        
[   7] add                loc6, arg1, arg2, OperandTypes(126, 126)
[  13] ret                loc6
Successors: [ ]


Optimized jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, LLIntFunctionCall, 15] with Baseline JIT into 960 bytes in 1.078797 ms.

As can be seen from the last line of the output above, the function jitMe has been optimised by the Baseline JIT. The next section will explore both compilation and execution in the Baseline JIT.

Baseline JIT

In a nutshell, the Baseline JIT is a template JIT and what that means is that it generates specific machine code for the bytecode operation. There are two key factors that allow the Baseline JIT a speed up over the LLInt3:

  • Removal of interpreter dispatch. Interpreter dispatch is the costliest part of interpretation, since the indirect branches used for selecting the implementation of an opcode are hard for the CPU to predict. This is the primary reason why Baseline is faster than LLInt.
  • Comprehensive support for polymorphic inline caching. It is possible to do sophisticated inline caching in an interpreter, but currently our best inline caching implementation is the one shared by the JITs.

The following sections will trace how the execution thresholds are reached, how execution transitions from LLInt to the Baseline JIT code via OSR (On Stack Replacement) and the assembly emitted by the Baseline JIT.

Implementation

Majority of the code for the Baseline JIT can be found under JavaScriptCore/jit. The JIT ABI is defined in JIT.h which will be a key item to review as part of the Baseline JIT and defines the various optimised templates for opcodes.

The Baseline JIT templates call assemblers defined in JavaScriptCore/assembler to emit machine code for the target architecture. For example the assemblers used to emit machine code for x86 architectures can be found under MacroAssemblerX86_64.h and X86Assembler.h

Tracing Execution

To enhance tracing in the Baseline JIT one can enable the --verboseOSR=true commandline flag in our launch.json. This flag will enable printing of useful information on the stages of optimisation from the LLInt to the Baseline JIT. The key statistic being the threshold counts for tiering up. Here’s an example of what the output with verboseOSR enabled would look like when executing out test script from the previous section:

#... truncated for brevity

Installing <global>#CLzrku:[0x7fffae3c4000->0x7fffeedcb768, LLIntGlobal, 116]
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, NoneFunctionCall, 15]: Optimizing after warm-up.
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, NoneFunctionCall, 15]: bytecode cost is 15.000000, scaling execution counter by 1.072115 * 1
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, NoneFunctionCall, 15]: 6 instructions (0 16-bit instructions, 0 32-bit instructions, 1 instructions with metadata); 135 bytes (120 metadata bytes); 3 parameter(s); 8 callee register(s); 6 variable(s); scope at loc4

bb#1
[   0] enter              
[   1] get_scope          loc4
[   3] mov                loc5, loc4
[   6] check_traps        
[   7] add                loc6, arg1, arg2, OperandTypes(126, 126)
[  13] ret                loc6
Successors: [ ]


Installing jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, LLIntFunctionCall, 15]
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, LLIntFunctionCall, 15]: Entered entry_osr_function_for_call with executeCounter = 500.001038/500.000000, 0
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, LLIntFunctionCall, 15]: Entered replace with executeCounter = 100.000214/100.000000, 0
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, LLIntFunctionCall, 15]: Entered replace with executeCounter = 105.000214/100.000000, 5
jitMe#AQcl4Q:[0x7fffae3c4130->0x7fffae3e5100, LLIntFunctionCall, 15]: Entered replace with executeCounter = 105.000214/100.000000, 5

#... truncated for brevity

Optimized jitMe#AQcl4Q:[0x7f85f7ac4130->0x7f85f7ae5100, LLIntFunctionCall, 15] with Baseline JIT into 960 bytes in 0.810440 ms.
jitMe#AQcl4Q:[0x7f85f7ac4130->0x7f85f7ae5100, LLIntFunctionCall, 15]: Entered replace with executeCounter = 105.000351/100.000000, 5
    JIT compilation successful.
Installing jitMe#AQcl4Q:[0x7f85f7ac4130->0x7f85f7ae5100, BaselineFunctionCall, 15]
    Code was already compiled.

Compiling the CodeBlock is a concurrent process and JavaScriptCore spawns a JITWorker thread to being compiling the codeblock with the Baseline JIT. In the interest of simplifying our debugging process, disable concurrent compilation and force compilation to occur on the main jsc thread. In order to do this add the --useConcurrentJIT=false flag to launch.json or on the commandline.

Additionally, JavaScriptCore provides two useful flags that allows adjustment to the JIT compilation threshold counters. These flags are --thresholdForJITSoon and --thresholdForJITAfterWarmUp. By adding the flag --thresholdForJITAfterWarmUp=10 reduce the static threshold count to initiate Baseline JIT optimisation from the default JITAfterWarmUp value of 500 to 10. If the engine determines that the codeblock was JIT compiled previously, it will use the JIT threshold default JITSoon of 100 which will be reduced to the value of 10 by using --thresholdForJITSoon=10.

Our launch.json should now look as follows:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/jsc",
            "args": ["--reportCompileTimes=true", "--dumpGeneratedBytecodes=true", "--useDFGJIT=false", "--verboseOSR=true", "--useConcurrentJIT=false", "--thresholdForJITAfterWarmUp=10", "--thresholdForJITSoon=10", "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/test.js"],
            //... truncated for brevity
        }
    ]
}

With these additional flags, lets now attempt to trace the optimisation of the following test program:

$ cat test.js

for(let x = 0; x < 5; x++){
    let y = x+10;
}

The bytecodes generated for this program are listed below:

<global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, NoneGlobal, 43]: 16 instructions (0 16-bit instructions, 0 32-bit instructions, 2 instructions with metadata); 163 bytes (120 metadata bytes); 1 parameter(s); 10 callee register(s); 6 variable(s); scope at loc4

bb#1
[   0] enter              
[   1] get_scope          loc4
[   3] mov                loc5, loc4
[   6] check_traps        
[   7] mov                loc6, Undefined(const0)
[  10] mov                loc6, Undefined(const0)
[  13] mov                loc7, <JSValue()>(const1)
[  16] mov                loc7, Int32: 0(const2)
[  19] jnless             loc7, Int32: 5(const3), 22(->41)
Successors: [ #3 #2 ]

bb#2
[  23] loop_hint          
[  24] check_traps        
[  25] mov                loc8, <JSValue()>(const1)
[  28] add                loc8, loc7, Int32: 10(const4), OperandTypes(126, 3)
[  34] inc                loc7
[  37] jless              loc7, Int32: 5(const3), -14(->23)
Successors: [ #2 #3 ]

bb#3
[  41] end                loc6
Successors: [ ]


Constants:
   k0 = Undefined
   k1 = <JSValue()>
   k2 = Int32: 0: in source as integer
   k3 = Int32: 5: in source as integer
   k4 = Int32: 10: in source as integer

The opcode loop_hint in basic block bb#2 is responsible for incrementing the JIT threshold counters, initiating compilation by the Baseline JIT if the execution threshold is breached and performing OSR entry. The loop_hint opcode is defined in LowLevelInterpreter.asm which essentially calls the macro checkSwitchToJITForLoop to determine if an OSR is required.

macro checkSwitchToJITForLoop()
    checkSwitchToJIT(
        1,
        macro()
            storePC()
            prepareStateForCCall()
            move cfr, a0
            move PC, a1
            cCall2(_llint_loop_osr)
            btpz r0, .recover
            move r1, sp
            jmp r0, JSEntryPtrTag
        .recover:
            loadPC()
        end)
end

The macro checkSwitchToJIT as seen in the previous section determines if the JIT threshold has been breached and performs a slow path call to _llint_loop_osr. This slow path, loop_osr is defined LLIntSlowPaths.cpp and is listed below:

LLINT_SLOW_PATH_DECL(loop_osr)
{
    //... code truncated for brevity
        
    auto loopOSREntryBytecodeIndex = BytecodeIndex(codeBlock->bytecodeOffset(pc));

    //... code truncated for brevity
    
    if (!jitCompileAndSetHeuristics(vm, codeBlock, loopOSREntryBytecodeIndex))  <-- Compilation with Baseline JIT
        LLINT_RETURN_TWO(nullptr, nullptr);
    
    //... code truncated for brevity

    const JITCodeMap& codeMap = codeBlock->jitCodeMap();
    CodeLocationLabel<JSEntryPtrTag> codeLocation = codeMap.find(loopOSREntryBytecodeIndex); <-- Retrieve location of the compiled code
    ASSERT(codeLocation);

    void* jumpTarget = codeLocation.executableAddress();
    ASSERT(jumpTarget);
    
    LLINT_RETURN_TWO(jumpTarget, callFrame->topOfFrame()); <-- Perform OSR to the location of the compiled code
}

As the truncated snippet above indicates, the function will first compile codeBlock with the call to jitCompileAndSetHeuristics and if the compilation succeeds, it will jump to the target address of the compiled code and resume execution. In addition to loop_osr there are additional flavours of OSR supported by the LLInt. These are entry_osr, entry_osr_function_for_call, entry_osr_function_for_construct, entry_osr_function_for_call_arityCheck and entry_osr_function_for_construct_arityCheck an essentially perform the same function as loop_osr and are defined here in LLIntSlowPaths.cpp.

Compilation

Let’s now examine how codeblock compilation works by stepping through the function jitCompileAndSetHeuristics:

inline bool jitCompileAndSetHeuristics(VM& vm, CodeBlock* codeBlock, BytecodeIndex loopOSREntryBytecodeIndex = BytecodeIndex(0))
{
    //... code truncated for brevity
    
    JITWorklist::ensureGlobalWorklist().poll(vm);
    
    switch (codeBlock->jitType()) {
    case JITType::BaselineJIT: {
        dataLogLnIf(Options::verboseOSR(), "    Code was already compiled.");
        codeBlock->jitSoon();
        return true;
    }
    case JITType::InterpreterThunk: {
        JITWorklist::ensureGlobalWorklist().compileLater(codeBlock, loopOSREntryBytecodeIndex);
        return codeBlock->jitType() == JITType::BaselineJIT;
    }
    default:
        dataLog("Unexpected code block in LLInt: ", *codeBlock, "\n");
        RELEASE_ASSERT_NOT_REACHED();
        return false;
    }
}

The function performs a simple check to determine if the codeBlock supplied needs to be JIT compiled and if compilation is required initiates a compilation thread with the call to JITWorkList::compileLater:

JITWorklist::ensureGlobalWorklist().compileLater(codeBlock, loopOSREntryBytecodeIndex);
return codeBlock->jitType() == JITType::BaselineJIT; 

Since concurrent JIT is disabled (from adding the --useConcurrentJIT=false flag), the function JITWorkList::compileLater calls Plan::compileNow to initiate compilation on the main jsc thread:

static void compileNow(CodeBlock* codeBlock, BytecodeIndex loopOSREntryBytecodeIndex)
    {
        Plan plan(codeBlock, loopOSREntryBytecodeIndex);
        plan.compileInThread();
        plan.finalize();
    }

The function Plan::compileInThread ends up calling JIT::compileWithoutLinking which essentially compiles the codeblock by utilising the MacroAssemblers to emit specific machine code for each bytecode in the instruction stream. The function compileWithoutLinking is listed below with the unimportant code truncated out:

void JIT::compileWithoutLinking(JITCompilationEffort effort)
{
    //... code truncated for brevity
    
    m_pcToCodeOriginMapBuilder.appendItem(label(), CodeOrigin(BytecodeIndex(0)));

    //... code truncated for brevity

    emitFunctionPrologue();
    emitPutToCallFrameHeader(m_codeBlock, CallFrameSlot::codeBlock);

    //... code truncated for brevity

    move(regT1, stackPointerRegister);
    checkStackPointerAlignment();

    emitSaveCalleeSaves();
    emitMaterializeTagCheckRegisters();

    //... code truncated for brevity

    privateCompileMainPass();
    privateCompileLinkPass();
    privateCompileSlowCases();
    
    //... code truncated for brevity

    m_bytecodeIndex = BytecodeIndex(0);

    //... code truncated for brevity
    
    privateCompileExceptionHandlers();
    
    //... code truncated for brevity

    m_pcToCodeOriginMapBuilder.appendItem(label(), PCToCodeOriginMapBuilder::defaultCodeOrigin());

    m_linkBuffer = std::unique_ptr<LinkBuffer>(new LinkBuffer(*this, m_codeBlock, effort));

    //... code truncated for brevity
}

The first few function calls emitFunctionPrologue() up until emitMaterializeTagCheckRegisters() emit machine code to perform stack management routines to be included in the Baseline JIT compiled code.

A handy setting to enable with the codebase to allow tracing of the various compilation passes would be the JITInternal::verbose flag.

namespace JITInternal {
static constexpr const bool verbose = true;
}

By enabling this flag, each bytecode being compiled would now be logged to stdout. This should look something similar to the snippet below:

Compiling <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, NoneGlobal, 43]
Baseline JIT emitting code for bc#0 at offset 168
At 0: 0
Baseline JIT emitting code for bc#1 at offset 294
At 1: 0
Baseline JIT emitting code for bc#3 at offset 306
At 3: 0
Baseline JIT emitting code for bc#6 at offset 314
//... truncated for brevity 

The first interesting function call is privateCompileMainPass().

void JIT::privateCompileMainPass()
{
    //... truncated for brevity

    auto& instructions = m_codeBlock->instructions();
    unsigned instructionCount = m_codeBlock->instructions().size();

    m_callLinkInfoIndex = 0;

    VM& vm = m_codeBlock->vm();
    BytecodeIndex startBytecodeIndex(0);
    
    //... code truncated for brevity

    m_bytecodeCountHavingSlowCase = 0;
    for (m_bytecodeIndex = BytecodeIndex(0); m_bytecodeIndex.offset() < instructionCount; ) {
        unsigned previousSlowCasesSize = m_slowCases.size();
        if (m_bytecodeIndex == startBytecodeIndex && startBytecodeIndex.offset() > 0) {
            // We've proven all bytecode instructions up until here are unreachable.
            // Let's ensure that by crashing if it's ever hit.
            breakpoint();
        }

        //... code truncated for brevity
        const Instruction* currentInstruction = instructions.at(m_bytecodeIndex).ptr();

        //... code truncated for brevity
        
        OpcodeID opcodeID = currentInstruction->opcodeID();

        //... code truncated for brevity

        unsigned bytecodeOffset = m_bytecodeIndex.offset();
        
        //... code truncated for brevity

        switch (opcodeID) {
        
        //... code truncated for brevity

        DEFINE_OP(op_del_by_id)
        DEFINE_OP(op_del_by_val)
        DEFINE_OP(op_div)
        DEFINE_OP(op_end)
        DEFINE_OP(op_enter)
        DEFINE_OP(op_get_scope)
        
        //... code truncated for brevity

        DEFINE_OP(op_lshift)
        DEFINE_OP(op_mod)
        DEFINE_OP(op_mov)
        //... code truncated for brevity
        default:
            RELEASE_ASSERT_NOT_REACHED();
        }

        //... code truncated for brevity
    }
}

The function loops over the bytecodes and calls the relevant JIT opcode for each bytecode. For example the first bytecode to be evaluated is op_enter which, via the switch case DEFINE_OP(op_enter), calls the function JIT::emit_op_enter. Let’s trace the mov opcode at bc#3, which moves the value in loc4 into loc5:

[   3] mov                loc5, loc4

setting a breakpoint at DEFINE_OP(op_mov) and stepping into the function call leads to JIT::emit_op_mov

void JIT::emit_op_mov(const Instruction* currentInstruction)
{
    auto bytecode = currentInstruction->as<OpMov>();
    VirtualRegister dst = bytecode.m_dst;
    VirtualRegister src = bytecode.m_src;

    if (src.isConstant()) {
        JSValue value = m_codeBlock->getConstant(src);
        if (!value.isNumber())
            store64(TrustedImm64(JSValue::encode(value)), addressFor(dst));
        else
            store64(Imm64(JSValue::encode(value)), addressFor(dst));
        return;
    }

    load64(addressFor(src), regT0);
    store64(regT0, addressFor(dst));
}

The functions load64 and store64 are defined in assembler/MacroAssemblerX86_64.h. The functions call an assembler which is responsible for emitting machine code for the operation. Lets examine the following load64 call:

void load64(ImplicitAddress address, RegisterID dest)
{
    m_assembler.movq_mr(address.offset, address.base, dest);
}

The function movq_mr is defined in X86Assembler.h as follows:

void movq_mr(int offset, RegisterID base, RegisterID dst)
    {
        m_formatter.oneByteOp64(OP_MOV_GvEv, dst, base, offset);
    }

The function oneByteOp64 listed above finally generates the machine code that gets written to an instruction buffer:

void oneByteOp64(OneByteOpcodeID opcode, int reg, RegisterID base, int offset)
{
            SingleInstructionBufferWriter writer(m_buffer);
            writer.emitRexW(reg, 0, base);
            writer.putByteUnchecked(opcode);
            writer.memoryModRM(reg, base, offset);
}

In this fashion, each bytecode is processed by the function JIT::privateCompileMainPass to emit Baseline JIT optimised machine code. The second function that needs to be considered is JIT::privateCompileLinkPass, which is responsible for adjusting the jump table to ensure the optimised bytecodes can reach the right execution branches (e.g. labels):

void JIT::privateCompileLinkPass()
{
    unsigned jmpTableCount = m_jmpTable.size();
    for (unsigned i = 0; i < jmpTableCount; ++i)
        m_jmpTable[i].from.linkTo(m_labels[m_jmpTable[i].toBytecodeOffset], this);
    m_jmpTable.clear();
}

Once the jump table has been re-linked appropriately, the next function of note to be called is JIT::privateCompileSlowCases.

Fast Path/Slow Path

As seen in previous sections when reviewing the LLInt, some opcodes define two types of execution paths: fast path and slow paths. The Baseline JIT when compiling bytecodes performs additional optimisations on opcodes that implement fast and slow paths. This compilation phase is performed by the call to JIT::privateCompileSlowCases:

void JIT::privateCompileSlowCases()
{
    m_getByIdIndex = 0;
    m_getByValIndex = 0;
    m_getByIdWithThisIndex = 0;
    m_putByIdIndex = 0;
    m_inByIdIndex = 0;
    m_delByValIndex = 0;
    m_delByIdIndex = 0;
    m_instanceOfIndex = 0;
    m_byValInstructionIndex = 0;
    m_callLinkInfoIndex = 0;

    //... code truncated for brevity
    unsigned bytecodeCountHavingSlowCase = 0;
    for (Vector<SlowCaseEntry>::iterator iter = m_slowCases.begin(); iter != m_slowCases.end();) {
        m_bytecodeIndex = iter->to;

        //... code truncated for brevity

        BytecodeIndex firstTo = m_bytecodeIndex;

        const Instruction* currentInstruction = m_codeBlock->instructions().at(m_bytecodeIndex).ptr();
        
        //... code truncated for brevity

        switch (currentInstruction->opcodeID()) {
        DEFINE_SLOWCASE_OP(op_add)
        DEFINE_SLOWCASE_OP(op_call)
        //... code truncated for brevity
        DEFINE_SLOWCASE_OP(op_jstricteq)
        case op_put_by_val_direct:
        DEFINE_SLOWCASE_OP(op_put_by_val)
        DEFINE_SLOWCASE_OP(op_del_by_val)
        DEFINE_SLOWCASE_OP(op_del_by_id)
        DEFINE_SLOWCASE_OP(op_sub)
        DEFINE_SLOWCASE_OP(op_has_indexed_property)
        DEFINE_SLOWCASE_OP(op_get_from_scope)
        DEFINE_SLOWCASE_OP(op_put_to_scope)

        //... code truncated for brevity
        default:
            RELEASE_ASSERT_NOT_REACHED();
        }

        //... code truncated for brevity

        emitJumpSlowToHot(jump(), 0);
        ++bytecodeCountHavingSlowCase;
    }

    //... code truncated for brevity
}

The function iterates over bytecodes that implement a slow path and emit machine code for each of these opcodes and also updates the jump table as required. Tracing the execution of emitted machine code for opcodes that implement a slow path is left to the reader as an exercise.

Linking

Once the codeblock has been successfully compiled, the next step is to link the machine code emitted by the assembler in order for the LLInt to OSR into the Baseline JIT optimised code. This beings by generating a LinkBuffer for the codeblock:

m_linkBuffer = std::unique_ptr<LinkBuffer>(new LinkBuffer(*this, m_codeBlock, effort));

The developer comments have the following to note about LinkBuffer:

LinkBuffer:

This class assists in linking code generated by the macro assembler, once code generation has been completed, and the code has been copied to is final location in memory. At this time pointers to labels within the code may be resolved, and relative offsets to external addresses may be fixed.

Specifically:

  • Jump objects may be linked to external targets,
  • The address of Jump objects may taken, such that it can later be relinked.
  • The return address of a Call may be acquired.
  • The address of a Label pointing into the code may be resolved.
  • The value referenced by a DataLabel may be set.

Initialisation of the LinkBuffer, eventually leads to a call to LinkBuffer::linkCode which is listed below:

void LinkBuffer::linkCode(MacroAssembler& macroAssembler, JITCompilationEffort effort)
{
    //... code truncated for brevity
    allocate(macroAssembler, effort);
    if (!m_didAllocate)
        return;
    ASSERT(m_code);
    AssemblerBuffer& buffer = macroAssembler.m_assembler.buffer();
    void* code = m_code.dataLocation();
    //... code truncated for brevity

    performJITMemcpy(code, buffer.data(), buffer.codeSize());

    //.. code truncated for brevity
    m_linkTasks = WTFMove(macroAssembler.m_linkTasks);
}

The key function in the snippet above is performJITMemcpy, which is a wrapper to memcpy. The emitted unlinked machine code which is stored in the assembler buffer is is copied over to the LinkeBuffer which is pointed to via code:

performJITMemcpy(code, buffer.data(), buffer.codeSize());

Once the LinkBuffer has been populated, the machine code is linked with the call to JIT::link which is invoked from JITWorklist::Plan::finalize:

CompilationResult JIT::link()
{
    LinkBuffer& patchBuffer = *m_linkBuffer;
    
    if (patchBuffer.didFailToAllocate())
        return CompilationFailed;

    // Translate vPC offsets into addresses in JIT generated code, for switch tables.
    for (auto& record : m_switches) {
        //... code to handle swtich cases truncated for brevity
    }

    for (size_t i = 0; i < m_codeBlock->numberOfExceptionHandlers(); ++i) {
        //... code to handle exception handlers truncated for brevity
    }

    for (auto& record : m_calls) {
        if (record.callee)
            patchBuffer.link(record.from, record.callee);
    }
    
    finalizeInlineCaches(m_getByIds, patchBuffer);
    finalizeInlineCaches(m_getByVals, patchBuffer);
    finalizeInlineCaches(m_getByIdsWithThis, patchBuffer);
    finalizeInlineCaches(m_putByIds, patchBuffer);
    finalizeInlineCaches(m_delByIds, patchBuffer);
    finalizeInlineCaches(m_delByVals, patchBuffer);
    finalizeInlineCaches(m_inByIds, patchBuffer);
    finalizeInlineCaches(m_instanceOfs, patchBuffer);

    //... code truncated for brevity

    {
        JITCodeMapBuilder jitCodeMapBuilder;
        for (unsigned bytecodeOffset = 0; bytecodeOffset < m_labels.size(); ++bytecodeOffset) {
            if (m_labels[bytecodeOffset].isSet())
                jitCodeMapBuilder.append(BytecodeIndex(bytecodeOffset), patchBuffer.locationOf<JSEntryPtrTag>(m_labels[bytecodeOffset]));
        }
        m_codeBlock->setJITCodeMap(jitCodeMapBuilder.finalize());
    }

    MacroAssemblerCodePtr<JSEntryPtrTag> withArityCheck = patchBuffer.locationOf<JSEntryPtrTag>(m_arityCheck);

    //... code truncated for brevity

    CodeRef<JSEntryPtrTag> result = FINALIZE_CODE(
        patchBuffer, JSEntryPtrTag,
        "Baseline JIT code for %s", toCString(CodeBlockWithJITType(m_codeBlock, JITType::BaselineJIT)).data());

    //... code truncated for brevity

    m_codeBlock->setJITCode(
        adoptRef(*new DirectJITCode(result, withArityCheck, JITType::BaselineJIT)));

    if (JITInternal::verbose)
        dataLogF("JIT generated code for %p at [%p, %p).\n", m_codeBlock, result.executableMemory()->start().untaggedPtr(), result.executableMemory()->end().untaggedPtr());

    return CompilationSuccessful;
}

Once linking has completed, a reference pointer to the JITed code is set in the codeblock. The Baseline JIT optimised code is now ready to be executed. The call stack at this point should resemble something similar to the snippet below:

libJavaScriptCore.so.1!JSC::JIT::link(JSC::JIT * const this) (/home/amar/workspace/WebKit/Source/JavaScriptCore/jit/JIT.cpp:970)
libJavaScriptCore.so.1!JSC::JITWorklist::Plan::finalize(JSC::JITWorklist::Plan * const this) (/home/amar/workspace/WebKit/Source/JavaScriptCore/jit/JITWorklist.cpp:55)
libJavaScriptCore.so.1!JSC::JITWorklist::Plan::compileNow(JSC::CodeBlock * codeBlock, JSC::BytecodeIndex loopOSREntryBytecodeIndex) (/home/amar/workspace/WebKit/Source/JavaScriptCore/jit/JITWorklist.cpp:87)
libJavaScriptCore.so.1!JSC::JITWorklist::compileLater(JSC::JITWorklist * const this, JSC::CodeBlock * codeBlock, JSC::BytecodeIndex loopOSREntryBytecodeIndex) (/home/amar/workspace/WebKit/Source/JavaScriptCore/jit/JITWorklist.cpp:238)
libJavaScriptCore.so.1!JSC::LLInt::jitCompileAndSetHeuristics(JSC::VM & vm, JSC::CodeBlock * codeBlock, JSC::BytecodeIndex loopOSREntryBytecodeIndex) (/home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LLIntSlowPaths.cpp:386)
libJavaScriptCore.so.1!JSC::LLInt::llint_loop_osr(JSC::CallFrame * callFrame, const JSC::Instruction * pc) (/home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LLIntSlowPaths.cpp:481)
libJavaScriptCore.so.1!llint_op_loop_hint() (/home/amar/workspace/WebKit/Source/JavaScriptCore/llint/LowLevelInterpreter64.asm:97)
[Unknown/Just-In-Time compiled code] (Unknown Source:0)

One can dump the disassembly of the JITed code by adding the --dumpDisassembly=true flag to launch.json or to the commandline. The disassembly for the compiled bytecodes printed to stdout would appear as follows:

Generated Baseline JIT code for <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43], instructions size = 43
   Source: for(let x = 0; x < 5; x++){ let y = x+10; }
   Code at [0x7fffaecff680, 0x7fffaecfffc0):
          0x7fffaecff680: nop 
          0x7fffaecff681: push %rbp
          0x7fffaecff682: mov %rsp, %rbp
          0x7fffaecff685: mov $0x7fffae3c4000, %r11
          #... truncated for brevity
    [   0] enter              
          0x7fffaecff728: push %rax
          0x7fffaecff729: mov $0x7ffff4f69936, %rax
          0x7fffaecff733: push %rcx
          #... truncated for brevity
    [   1] get_scope          loc4
          0x7fffaecff7d4: push %rax
          0x7fffaecff7d5: mov $0x7ffff4f69936, %rax
          0x7fffaecff7df: push %rcx
          #... truncated for brevity
    [  23] loop_hint          
          0x7fffaecff9a8: push %rax
          0x7fffaecff9a9: mov $0x7ffff4f69936, %rax
          0x7fffaecff9b3: push %rcx
          0x7fffaecff9b4: mov $0x7ffff4f6a88b, %rcx
          0x7fffaecff9be: push %rdx
          0x7fffaecff9bf: mov $0x7ffff4f63e72, %rdx
          0x7fffaecff9c9: push %rbx
          0x7fffaecff9ca: mov $0x7fffeedfe628, %rbx
          0x7fffaecff9d4: call *%rax
    (End Of Main Path)
    (S) [   6] check_traps        
          0x7fffaecffb70: push %rax
          0x7fffaecffb71: mov $0x7ffff4f69936, %rax
          0x7fffaecffb7b: push %rcx
          0x7fffaecffb7c: mov $0x7ffff4f6a88b, %rcx
          #... truncated for brevity
    (S) [  28] add                loc8, loc7, Int32: 10(const4), OperandTypes(126, 3)
          0x7fffaecffd0d: push %rax
          0x7fffaecffd0e: mov $0x7ffff4f69936, %rax
          0x7fffaecffd18: push %rcx
          #... truncated for brevity
    (S) [  34] inc                loc7
          0x7fffaecffd99: push %rax
          0x7fffaecffd9a: mov $0x7ffff4f69936, %rax
          0x7fffaecffda4: push %rcx
          #... truncated for brevity
    (S) [  37] jless              loc7, Int32: 5(const3), -14(->23)
          0x7fffaecffe10: push %rax
          0x7fffaecffe11: mov $0x7ffff4f69936, %rax
          0x7fffaecffe1b: push %rcx
          0x7fffaecffe1c: mov $0x7ffff4f6a88b, %rcx
          0x7fffaecffe26: push %rdx
          #... truncated for brevity
    (End Of Slow Path)
          0x7fffaecffec5: mov $0x7fffae3c4000, %rdi
          0x7fffaecffecf: mov $0x0, 0x24(%rbp)
          0x7fffaecffed6: mov $0x7fffaeb09fd8, %r11
          #... truncated for brevity
JIT generated code for 0x7fffae3c4000 at [0x7fffaecff680, 0x7fffaecfffc0).

JIT Execution

Stepping back up the call stack, execution returns to llint_loop_osr. The compiled and linked machine code is now referenced by codeBlock and to complete the OSR to the JITed code, the engine needs to retrieve the executableAddress to the JITed code and have the LLInt jump to this address and continue execution:

if (!jitCompileAndSetHeuristics(vm, codeBlock, loopOSREntryBytecodeIndex))
        LLINT_RETURN_TWO(nullptr, nullptr);
    
    //... code truncated for brevity

    const JITCodeMap& codeMap = codeBlock->jitCodeMap();
    CodeLocationLabel<JSEntryPtrTag> codeLocation = codeMap.find(loopOSREntryBytecodeIndex);

    void* jumpTarget = codeLocation.executableAddress();
    
    LLINT_RETURN_TWO(jumpTarget, callFrame->topOfFrame());

The snippet code demonstrates how the engine first retrieves a codeMap of the compiled codeBlock and then looks up the codeLocation of the bytecode index that OSR entry is to be performed. In our example, this is the loop_hint bytecode index. The executable address is retrieved from codeLocation and along with the callFrame is passed as arguments to LLINT_RETURN_TWO. The call to LLINT_RETURN_TWO returns execution back to the checkSwitchToJITForLoop:

macro checkSwitchToJITForLoop()
    checkSwitchToJIT(
        1,
        macro()
            storePC()
            prepareStateForCCall()
            move cfr, a0
            move PC, a1
            cCall2(_llint_loop_osr)
            btpz r0, .recover      <-- execution returns here after Baseline JIT compilation
            move r1, sp
            jmp r0, JSEntryPtrTag  <-- jump to JITed code
        .recover:
            loadPC()
        end)
end

Setting a breakpoint at jmp r0, JSEntryPtrTag and inspecting the value of r0 (which is rax), it is observed that it contains the executable address into our JITed code. More specifically the executable address to the start of the compiled loop_hint bytecode.

jump-to-baseline-jit

One can enable tracing execution of our compiled code in the Baseline JIT by adding the "--traceBaselineJITExecution=true" flag to launch.json or on the commandline. This will force the Baseline JIT to insert tracing probes in the compiled code. These logging probes will print to stdout:

Optimized <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, LLIntGlobal, 43] with Baseline JIT into 2368 bytes in 1.509752 ms.
JIT generated code for 0x7fffae3c4000 at [0x7fffaecff680, 0x7fffaecfffc0).
    JIT compilation successful.
Installing <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [23] op_loop_hint cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [24] op_check_traps cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [25] op_mov cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [28] op_add cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [34] op_inc cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [37] op_jless cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]
JIT [41] op_end cfr 0x7fffffffcb40 @ <global>#DETOqr:[0x7fffae3c4000->0x7fffeedcb768, BaselineGlobal, 43]

Profiling Tiers

As bytecode is executed in the LLInt and in the Baseline JIT tier, the two JIT tiers gather profile data on the bytecodes being executed which is then propagated to the higher optimising JIT tiers (i.e. DFG and FTL). Profiling aids in speculative compilation3 which is a key aspect of JIT engine optimisations. The Profiling section3 of the WebKit blog does an excellent job of describing the need for profiling, the philosophy behind JavaScriptCore’s profiling goals and details on profiling implementation in the two profiling tiers. It is recommended that the reader first familiarise themselves with this before continuing with the rest of this section.

profiling_tiers

To re-iterate, the key design goals of profiling in JavaScriptCore are summarised in the excerpt below3:

  • Profiling needs to focus on noting counterexamples to whatever speculations we want to do. We don’t want to speculate if profiling tells us that the counterexample ever happened, since if it ever happened, then the effective value of this speculation is probably negative. This means that we are not interested in collecting probability distributions. We just want to know if the bad thing ever happened.
  • Profiling needs to run for a long time. It’s common to wish for JIT compilers to compile hot functions sooner. One reason why we don’t is that we need about 3-4 “nines” of confidence that that the counterexamples didn’t happen. Recall that our threshold for tiering up into the DFG is about 1000 executions. That’s probably not a coincidence.

Profiling Sources

The profiling tier gathers profiling data from six main sources these are listed as follows3:

  • Case Flags – support branch speculation
  • Case Counts – support branch speculation
  • Value Profiling – type inference of values
  • Inline Caches – type inference of object structure
  • Watchpoints – support heap speculation
  • Exit Flags – support speculation backoff

Case Flags aid branch speculation and the engine uses bit flags to determine if a branch was taken and speculate accordingly. Case Counts are a legacy version of Case Flags where instead of setting a bit flag, it would count the number of times a branch was taken. This approach proved less optimal (details of which are discussed in the webkit blog3) and with a few exceptions this profiling method has largely been made obsolete. Tracing the add opcode is good exercise in exploring how Case Flags function.

Value profiling is the most common profiling method used by JavaScriptCore. Since JavaScript is a dynamic language, JavaScriptCore needs a way to infer the runtime types of the raw values (e.g. integers, doubles, strings, objects, etc) being used by the program. In order to do this JavaScriptCore encode raw values by a process called NaN-boxing which allows the engine to infer the type of data being operated on. This encoding mechanism is documented in JavaScriptJSValue.h.

Inline Caches play a huge role not only in speculative compilation but also in optimising property access and function calls. Inline caches allow the engine to infer the type of a value based on the operation that was performed (e.g. property access and function calls). Discussing the subject of Inline caches would take a blog post in itself and indeed the webkit blog3 devotes a fair chunk of it to discussing Inline Caches in great detail and the speculation it allows JavaScriptCore to perform. It is highly recommended that the reader take the time to explore Inline Caches as an exercise.

Both Watchpoints and Exit Flags become more relevant when exploring the optimising tiers in later parts of this blog series. For the moment it is sufficient to briefly describe them as defined in the webkit blog3:

A watchpoint in JavaScriptCore is nothing more than a mechanism for registering for notification that something happened. Most watchpoints are engineered to trigger only the first time that something bad happens; after that, the watchpoint just remembers that the bad thing had ever happened.

watchpoints let inline caches and the speculative compilers fold certain parts of the heap’s state to constants by getting a notification when things change.

Previous this post touched upon Exit counts briefly in the Tiering Up section of the LLInt, Exit flags are functionally similar in the sense that they record why an OSR Exit occurred. The WebKit blog summarise Exit flags as follows:

The exit flags are a check on the rest of the profiler. They are telling the compiler that the profiler had been wrong here before, and as such, shouldn’t be trusted anymore for this code location.

Given the length of this blog post, a conscious decision was made to avoid deep dives into each of these profiling methods. To reiterate a point made earlier, its is recommended that the reader utilise the WebKit blog3 and the debugging methodology demonstrated in this blog to explore the various profiling methods themselves.

JavaScriptCore Profiler

The jsc shell provides a commandline line option to record profiling information for the code being executed, which can then be parsed into a human readable format. This can be achieved by adding the -p <file path to store profile data> to the commandline args. The profiler captures data from the sources mentioned in the previous section and stores them as serialised JSON. This profile data can then be parsed using the display-profiler-output script which allows examining various aspects of the profiled information5 6.

Let’s enable profiling in our debugging environment by adding the -p flag to the commandline arguments. Note this can also be setup by adding the --enableProfiler=true commandline flag and setting up the environment variable JavaScript_PROFILER_PATH to output the profile data. Updating launch.json to enable profile data gathering should be similar to the one below:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/jsc",
            "args": [ "--useDFGJIT=false", "-p", "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/profile_data.json", "/home/amar/workspace/WebKit/WebKitBuild/Debug/bin/test.js"],
            //... truncated for brevity
        }
    ]
}

Let’s now use the following test script to capture some profiling information. The one below will generate value profiling information for the add function:

$ cat test.js

function add(x,y){
    return x+y;
}

let x = 1;

for(let y = 0; y < 1000; y++){
    add(x,y)
}

In the snippet above the add function takes two arguments x and y. This function is called several times in the for loop to trigger Baseline JIT optimisation. From static inspection of the js code above we can determine that the arguments x and y are going to be Int32 values. Now let’s review the profile data collected for the js code above with display-profiler-output:

~/workspace/WebKit$ ./Tools/Scripts/display-profiler-output ./WebKitBuild/Debug/bin/profile_data.json

   CodeBlock    #Instr  Source Counts         Machine Counts      #Compil  Inlines     #Exits      Last Opts    Source                                                                            
                       Base/DFG/FTL/FTLOSR  Base/DFG/FTL/FTLOSR                       Src/Total   Get/Put/Call
  add#DzQYpR      15      967/0/0/0             967/0/0/0               1    0/0       0             0/0/0     function add(x,y){ return x+y; }
<global>#D5ICt4  116      506/0/0/0             506/0/0/0               1    0/0       0             0/0/0     function add(x,y){ return x+y; } let x = 1; for(let y = 0; y < 1000; y++){ add(x,y) }
> 

This prints statistics on the execution counts of the various codeblocks that were generated, the number compilations that occurred, Exit counts, etc. It also drops into a prompt that allows us to query additional information on the profile data gathered. The help command outputs the various supported options:

> help
summary (s)     Print a summary of code block execution rates.
full (f)        Same as summary, but prints more information.
source          Show the source for a code block.
bytecode (b)    Show the bytecode for a code block, with counts.
profiling (p)   Show the (internal) profiling data for a code block.
log (l)         List the compilations, exits, and jettisons involving this code block.
events (e)      List of events involving this code block.
display (d)     Display details for a code block.
inlines         Show all inlining stacks that the code block was on.
counts          Set whether to show counts for 'bytecode' and 'display'.
sort            Set how to sort compilations before display.
help (h)        Print this message.
quit (q)        Quit.
> 

To output the profiling data collected for the codeblock use the profiling command:

> p DzQYpR
Compilation add#DzQYpR-1-Baseline:
      arg0: predicting OtherObj
      arg1: predicting BoolInt32
      arg2: predicting NonBoolInt32
        [   0] enter              
        [   1] get_scope          loc4
        [   3] mov                loc5, loc4
        [   6] check_traps        
        [   7] add                loc6, arg1, arg2, OperandTypes(126, 126)
        [  13] ret                loc6
> 

arg1 and arg2 represent the argument variables x and y in our test script. From the snippet above, one can see that the profile information gathered for the two arguments. JavaScriptCore has predicted that that arg1 could either be a Bool or an Int32 value and that arg2 can be a NonBool or an Int32 value.

Let’s attempt to modify the script to supply a mix of argument types to the add function:

$ cat test.js

function add(x,y){
    return x+y;
}

let x = 1.0;

for(let y = 0.1; y < 1000; y++){
    add(x,y)
}

The arguments x and y would now be passed floating point values. Lets re-generate profiling data and examine the predicted values:

~/workspace/WebKit$ ./Tools/Scripts/display-profiler-output ./WebKitBuild/Debug/bin/profile_data.json
   CodeBlock    #Instr  Source Counts       Machine Counts      #Compil  Inlines        #Exits     Last Opts    Source                                                                                           
                       Base/DFG/FTL/FTLOSR Base/DFG/FTL/FTLOSR                        Src/Total   Get/Put/Call
  add#DzQYpR      15      967/0/0/0             967/0/0/0              1       0/0       0          0/0/0        function add(x,y){ return x+y; }
<global>#AYjpyE  116      506/0/0/0             506/0/0/0              1       0/0       0          0/0/0        function add(x,y){ return x+y; } let x = 1.0; for(let y = 0.1; y < 1000; y++){ add(x,y) }
> p DzQYpR
Compilation add#DzQYpR-1-Baseline:
      arg0: predicting OtherObj
      arg1: predicting AnyIntAsDouble
      arg2: predicting NonIntAsDouble
        [   0] enter              
        [   1] get_scope          loc4
        [   3] mov                loc5, loc4
        [   6] check_traps        
        [   7] add                loc6, arg1, arg2, OperandTypes(126, 126)
        [  13] ret                loc6
> 
 

As seen in the output above the predicted types for x and y are now any AnyIntAsDouble and NonIntAsDouble. The definitions for the various speculated types can be found in SpeculatedType.h which is generated at compile time and is located under <WebKit build directory>/Debug/DerivedSources/ForwardingHeaders/JavaScriptCore/SpeculatedType.h. The table below7 shows the unique speculation types used in JavaScriptCore:

speculation-types

Conclusion

This post explored the LLInt and the Baseling JIT by diving into the details of their implementation and tracing execution of bytecodes in the two tiers. The post also provided an overview of offlineasm assembly and how to understand the LLInt implementation that is written in this assembly. With the understanding of how the LLInt is constructed and how execution in the LLInt works, the post discussed OSR which is the mechanism that allows execution to transition from a lower tier to a higher JIT tier. Finally the blog post concluded by briefly touching upon the subject of profiling bytecode and the various sources used by the engine to gather profiling data.

In Part III of this blog series dives into the details of the Data Flow Graph (DFG). The DFG is the first of two optimising compilers used by JavaScriptCore and the post will explore the process of tiering up from the Baseline JIT into the DFG, the DFG IR and the DFG compiler pipeline.

We hope you’ve found this post informative, if you have questions, spot something that’s incorrect or have suggestions on improving this writeup do reach out to the author @amarekano or @Zon8Research on Twitter. We are more than happy to discuss this at length with anyone interested in this subject and would love to hear your thoughts on it.

Appendix