Review of run-time performance post the LFS changes

LFS moves code and read-only strings that are represented using various Lua VM record structures into Flash memory. ESP8288 hardware only supports word-aligned data access from flash, so 8 and 16-bit accesses are handled using software exception by our unaligned exception handler. This is functionally transparent to executing code, though handling such exceptions also has a run-time cost. The purpose of this review was to ensure that the performance impacts of using LFS are acceptable.

  • The elements moved in RO flash are in Lua structures: Proto, TValue, TString and stringtable. Nearly all of the fields in these structures are 32bit aligned. None are 16bit aligned and there are a few byte sized fields. (Grepping an object dump of the liblua.a library and visually reviewing the 16bit hits confirms this.) Similarly 16bit constants are rarely used in Lua code and modules, so 95+% of software exceptions are handling execution of the l8ui instruction accessing data in the mapped flash address space. I have therefore tweaked the exception handler code to optimize the lu8i path, whilst still giving good performance for lu16ui and lu16si` instruction execution.
  • I initially thought that the main hotspots would be with:
    • The 8-bit tt field in the GCOject but GCobjects are referenced through a Tvalue which also has a 32-bit copy of this field, and nearly all of the RO access paths use this TValue copy.
    • The corresponding marked field in the GCOject is checked during GC (mainly to short-circuit any attempted GC of RO objects which aren’t collectable, so I will replace this access by a macro which generates aligned access for Xtensa targets.
    • The string and memory functions (such as c_strlen) map onto ROM-based code which are already optimised for efficient flash access. For example, see Pfalcon’s ROM disassembly starting offset 0x4000bf4c for the strlen() code. These routines fetch the data in 32-bit chunks into a register and then test each of the 4 bytes in the register separately. This approach generates no exceptions so long as the string operands are word-aligned. One of the changes introduced in the LFS patch as recommended by @pjsg was to go from -Os to -O2 optimisation. The only code generation impact here is to remove packing of strings and make all string allocations word aligned. This has a small impact (~1%) on code size, but has a major performance benefit in removing nearly all unaligned software exceptions.
  • The typical cost of GC is roughly halved with LFS, since a large percentage of persistent GCobjects are moved into RO memory and out of the scope of GC sweeps.

So the bottom line is that the LFS release actually represents an overall performance improvement (i) because of the switch to -O2 and (ii) because of the GC savings. Good news.

load_non_32_wide_handler

I played around with the code whilst checking the disassembled code. The code path for l8ui exceptions is now 5 instructions shorter, for the cost of a couple of extra instructions on the l16 paths, maybe not really worth doing, but not worth discarding either now that I have done it.

Avoiding l8ui exceptions

I have added the following macro to lobject.h:

#define GET_BYTE_FN(name,wo,bo) \
static inline lu_byte get ## name(void *o) { \
  lu_byte res;  /* extract named field */ \
  asm ("l32i  %0, %1, " #wo "; extui %0, %0, " #bo ", 8;" : "=r"(res) : "r"(o) : );\
  return res; }

and so GET_BYTE_FN(getmarked,4,8) creates the inline access function to get the marked field without creating an exception on the Xtensa compiles. (The host version just returns (o)->name.)

Analysing the code these macros for tt and marked in the macros that already wrap these field references give the following stats

Pattern Without With Delta
extui a, b, 0, 8 69 84 +21
extui a, b, 8, 8 0 11 +11
l8ui 585 576 -9
lua_rotable_base 0x4026ff48 0x4026ff98 80 bytes

So we have 33 locations in the code where extui extracts avoid S/W exceptions at a cost of 80 bytes code increase on the firmware. smile

The easiest way to do this analysis is to use the Xtensa toolchain object dump to disassemble individual object files or even object libraries.  As the standard make creates a single liblua.a containing all of the object files for the Lua VM, this is the sort of command that I run to get these figures:

xtensa-lx106-elf-objdump -d  app/lua/.output/eagle/debug/lib/liblua.a \
 | grep "extui.*? 8, 8"

Leave a Reply