LSS - Under the hood

There are four background threads that take on responsibility for writing segments, flushing the log, check pointing the store (setting the point from which a recovery scan is required) and cleaning partially fill segments to avoid fragmentation.

The Recoverable Packet Map (RPM) is an 8 level hierarchical map keyed by 64 bit OID used to record the locations of blobs in the store. The Segment Utilisation Table (SUT) records the utilisation of every segment in the LSS. Both of these data structures are only updated on disk during a check point. Check points are performed after writing 64MB to the store.

No need for Write Ahead Logging

Data is read or written in pages in most conventional DB systems. Atomicity of transactions is achieved by the technique of Write Ahead Logging (WAL). This means changes to the data pages must first be recorded and flushed in a separate log file, which can be scanned during recovery as required to undo/redo partially completed transactions to ensure the database recovers back to a consistent state when the server is restarted.

Unfortunately non-volatile storage devices have write caches which can reorder writes, defeating the WAL assumption. For that reason it is actually unsafe for products like SQL Server or Oracle to use stock hardware without disabling the write caches which can reorder writes (but that is rarely done because write performance would drop by a factor of 10 or more - maybe even 100).

By contrast the CEDA LSS uses a far more resilient and efficient system for crash recovery that is almost independent of the order of writes and therefore allows for crash recovery to be performed without needing to disable the write cache on the hard-disk.

The CEDA LSS achieves excellent write performance by treating the log itself as the data! Therefore all writing occurs at the end of the log, allowing for continuous writing at the sustained write rate of the storage device.

The log is divided up into relatively large pieces (called segments), perhaps 4MB. Reading and writing at this coarse granularity minimises disk head seeking overheads.

Segments in the LSS File

The underlying LSS file consists of a root block followed by a linear list of segments. The segment size can be chosen at the time the store is created. Typically a segment size of about 4MB is appropriate. The ideal size depends on how well the data tends to be clustered on disk, and the product of latency and bandwidth. It represents a tradeoff between sustained I/O and random access.

Serial elements are written one after the other to a segment with only a 9 byte header on each serial element. The space overhead of the CEDA LSS is very low in practise.

If the end of the segment is encountered while writing a serial element it can be continued in a fresh segment.

Forward chaining of segments

Segments are forward chained to allow for a crash recovery scan if the store is not closed properly.

The segment cleaner

Segments are cleaned automatically when their space utilisation falls below a threshold. If segment utilisation falls below 80% then the remaining serial elements are copied to the end of the log and the segment is recycled (put into a free segment list).