Kyoto Cabinet ingestion rate measurement

Kyoto Cabinet is a high performance database storage engine. The database is a simple data file organized in either a hash table or a B+ tree containing key value pairs which are variable length byte sequences.

The performance measurements have been made on an Intel Core i7-4700MQ 2.4GHz laptop with 16GB RAM, running Windows 10 64 bit.

Writing to the following was measured:

The RAM disk is useful for measuring the CPU load.

The time to write 1M, 10M or 100M records (i.e. key-value pairs) with sequentially increasing 64 bit keys has been measured. The size of the mapped value is either 4,8,16,32,..., or 8192 bytes. In the case of the 4GB RAM disk there is a limit imposed on the mapped value size to allow the database to fit on a 4GB drive. The data is written using 1000 transactions.

After writing the store it is closed then reopened, and all the records in the database are read with sequentially increasing keys.

It might be expected that the Kyoto HashDB will perform badly when the amount of data is too large to fit in the memory cache, if the sequential key access is randomised by the hash function.

Building the Kyoto library

Kyoto Cabinet version 1.2.76 was built using Microsoft Visual Studio 2015. The VCmakefile was modified to support an x64 build under VS2015 with dynamic CRT libs (i.e. using the /MD compiler switch instead of /MT):


VCPATH = C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC
SDKPATH = C:\Program Files (x86)\Windows Kits\8.1
WINKITS_10 = C:\Program Files (x86)\Windows Kits\10
WINKITS_VER = 10.0.10240.0

CLFLAGS = /nologo \
  /I "$(VCPATH)\Include" /I "$(VCPATH)\PlatformSDK\Include" /I "$(SDKPATH)\Include" \
  /I "." \
  /DNDEBUG /D_CRT_SECURE_NO_WARNINGS \
  /O2 /EHsc /W3 /wd4244 /wd4351 /wd4800 /MD

LIBFLAGS = /nologo \
  /libpath:"$(VCPATH)\lib\amd64" /libpath:"$(WINKITS_10)\Lib\$(WINKITS_VER)\ucrt\x64" /libpath:"$(SDKPATH)\Lib\winv6.3\um\x64" \
  /libpath:"."
LINKFLAGS = /nologo \
  /libpath:"$(VCPATH)\lib\amd64" /libpath:"$(WINKITS_10)\Lib\$(WINKITS_VER)\ucrt\x64" /libpath:"$(SDKPATH)\Lib\winv6.3\um\x64" \
  /libpath:"."

Test code

HashDB database tuning

The following C++ code has been used to tune the HashDB for a given number of records. Earlier results without tuning suggest this is worth doing to achieve better performance.


void tuneDB(HashDB& db, int numRecords)
{
    db.tune_options(HashDB::TLINEAR);
    db.tune_buckets(2*numRecords);
    db.tune_map(512*1024*1024);
}

TreeDB database tuning

The following C++ code has been used to tune the TreeDB for a given number of records


void tuneDB(TreeDB& db, int numRecords)
{
    db.tune_options(TreeDB::TLINEAR);
    db.tune_buckets(numRecords / 10);
    db.tune_map(512*1024*1024);
}

Code to write key value pairs with sequential keys to a database

The following C++ code has been used to insert key,value pairs ("records") where the keys are sequentially increasing 64 bit numbers. This is achieved using a given number of transactions where each transaction inserts a given number of records. For simplicity the error handling code is not shown.


void write(const char* path, int numTxns, int numRecordsPerTxn, int mappedValueSize)
{
    HashDB db;
    int numRecords = numTxns * numRecordsPerTxn;
    tuneDB(db, numRecords);
    db.open(path, HashDB::OWRITER | HashDB::OCREATE);
    int64 id = 0;
    std::vector<char> buffer(mappedValueSize);
    for (int t=0 ; t < numTxns ; ++t)
    {
        db.begin_transaction();
        for (int r = 0; r < numRecordsPerTxn; ++r)
        {
            db.set( (const char*)&id, sizeof(id), buffer.data(), buffer.size());
            ++id;
        }
        db.end_transaction();
    }
    db.close();
}

Code to read key value pairs with sequential keys from a database

The following C++ code has been used to read all the records with sequentially increasing keys.


void read(const char* path, int numTxns, int numRecordsPerTxn, int mappedValueSize)
{
    HashDB db;
    int numRecords = numTxns * numRecordsPerTxn;
    tuneDB(db, numRecords);
    db.open(path, HashDB::OWRITER);
    std::vector<char> buffer(mappedValueSize);
    for (int64 id=0 ; id < numRecords ; ++id)
    {
        db.get((const char*)&id, sizeof(id), buffer.data(), buffer.size());
    }
    db.close();
}

Results

It would be good to do comparisons for the pair of SSDs in RAID0. Unfortunately it appears they have become heavily fragmented over time, such that the maximum write rate is only about 300-400MB/sec (it was about 850MB/sec originally).

The write and read rate is given both in terms of the records per second (the unit is kops/s which means 1000 records per second) and bytes per second (the unit is MB/s which means 220 bytes per second).

HashDB with RAM disk

num txnsnum records per txnmapped value size (bytes)total data (bytes)database size (bytes)space overhead per record (bytes)write data time (s)read data time (s)write record rate (kops/s)read record rate (kops/s)effective write rate (MB/s)effective read rate (MB/s)
1,0001,000412,000,00036,589,24024.62.220.364502,8155.232.2
1,0001,000816,000,00044,589,24028.62.220.374502,7356.941.7
1,0001,0001624,000,00052,589,24028.62.190.374562,68010.461.3
1,0001,0003240,000,00068,589,24028.62.200.474552,12317.481.0
1,0001,0006472,000,000100,589,24028.62.200.484542,09431.2143.8
1,0001,000128136,000,000164,589,24028.62.260.524421,91757.3248.6
1,0001,000256264,000,000292,589,24028.62.350.574251,752106.9441.1
1,0001,000512520,000,000548,589,24028.62.570.783901,275193.2632.1
1,0001,0001,0241,032,000,0001,060,589,24028.64.173.07240326235.8320.6
1,0001,0002,0482,056,000,0002,084,589,24028.65.904.56169219332.2429.6
1,0001,0004,0964,104,000,0004,132,589,24028.68.275.76121173473.4679.1
1,00010,0004120,000,000382,365,64026.220.843.664802,7325.531.3
1,00010,0008160,000,000462,365,64030.220.973.744772,6747.340.8
1,00010,00016240,000,000542,365,64030.221.164.204732,38310.854.6
1,00010,00032400,000,000702,365,64030.226.6618.0637555414.321.1
1,00010,00064720,000,0001,022,365,64030.232.9128.2330435420.924.3
1,00010,0001281,360,000,0001,662,365,64030.239.3935.5625428132.936.5
1,00010,0002562,640,000,0002,942,365,64030.244.4740.6322524656.662.0
1,000100,00041,200,000,0003,754,364,73625.51,005.80542.76991841.12.1

Note that the performance drops when the database size exceeds the cache (512MB). For example for 4 byte mapped values the write rate dropped from 5MB/s to about 1MB/s when writing 100M records.

TreeDB with RAM disk

num txnsnum records per txnmapped value size (bytes)total data (bytes)database size (bytes)space overhead per record (bytes)write data time (s)read data time (s)write record rate (kops/s)read record rate (kops/s)effective write rate (MB/s)effective read rate (MB/s)
1,0001,000412,000,00037,548,54425.68.131.011239871.411.3
1,0001,000816,000,00034,189,05618.27.431.001359962.115.2
1,0001,0001624,000,00048,890,88024.96.410.981561,0203.623.3
1,0001,0003240,000,00074,410,75234.48.141.101239064.734.6
1,0001,0006472,000,000187,328,256115.39.184.441092257.515.5
1,0001,000128136,000,000295,424,512159.413.076.45771559.920.1
1,0001,000256264,000,000567,815,680303.817.779.735610314.225.9
1,0001,000512520,000,0002,054,669,8241,534.730.0416.09336216.530.8
1,00010,0004120,000,000439,591,68032.0165.68231.1360430.70.5
1,00010,0008160,000,000335,977,21617.6176.95232.5157430.90.7
1,00010,00016240,000,0002,128,638,976188.9261.68298.5238330.90.8
1,00010,00032400,000,000999,380,73659.9276.85305.1036331.41.3
1,00010,00064720,000,0002,531,212,800181.1394.35283.1225351.72.4
1,00050,0004600,000,0002,605,504,76840.12,980.222,522.9017200.20.2

These measurements suggest the TreeDB doesn't perform well when the amount of data exceeds the memory cache.

CEDA LSS with RAM disk

num txnsnum records per txnmapped value size (bytes)total data (bytes)database size (bytes)space overhead per record (bytes)write data time (s)read data time (s)write record rate (kops/s)read record rate (kops/s)effective write rate (MB/s)effective read rate (MB/s)
1,0001,000412,000,00025,690,11213.70.460.402,1912,48125.128.4
1,0001,000816,000,00029,360,12813.40.460.412,1572,45032.937.4
1,0001,0001624,000,00037,748,73613.80.470.412,1392,43849.055.8
1,0001,0003240,000,00053,477,37613.50.510.421,9802,37075.590.4
1,0001,0006472,000,00085,458,94413.50.530.451,8702,209128.4151.7
1,0001,000128136,000,000149,422,08013.40.600.491,6792,024217.8262.5
1,0001,000256264,000,000277,348,35213.40.810.601,2351,658311.0417.5
1,0001,000512520,000,000533,725,18413.71.160.838621,200427.4595.2
1,0001,0001,0241,032,000,0001,045,430,27213.41.530.966551,037644.31,020.2
1,0001,0002,0482,056,000,0002,069,364,73613.42.281.22439821861.51,609.2
1,0001,0004,0964,104,000,0004,117,757,95213.83.801.752635721,029.92,239.3
1,00010,0004120,000,000251,658,24013.24.614.322,1672,31424.826.5
1,00010,0008160,000,000292,028,41613.24.684.172,1392,39932.636.6
1,00010,00016240,000,000371,720,19213.24.734.382,1162,28448.452.3
1,00010,00032400,000,000531,628,03213.24.924.502,0312,22477.584.8
1,00010,00064720,000,000851,968,00013.25.184.471,9292,236132.5153.5
1,00010,0001281,360,000,0001,491,599,36013.25.634.761,7762,101230.4272.5
1,00010,0002562,640,000,0002,771,910,65613.27.125.431,4041,842353.4463.6
1,000100,00041,200,000,0002,515,009,53613.245.8141.962,1832,38325.027.3
1,000100,00081,600,000,0002,915,041,28013.246.3841.812,1562,39232.936.5
1,000100,000162,400,000,0003,715,104,76813.246.8142.202,1362,37048.954.2

The CEDA LSS scales very well in this test, even though the database size well exceeds the size of the LSS segment cache. For 4, 8 and 16 byte mapped values the write and read rates are independent of the number of records from 1M to 100M records:

num records4 byte value write rate (MB/s)4 byte value read rate (MB/s)8 byte value write rate (MB/s)8 byte value read rate (MB/s)16 byte value write rate (MB/s)16 byte value read rate (MB/s)
1,000,00025.128.432.937.449.055.8
10,000,00024.826.532.636.648.452.3
100,000,00025.027.332.936.548.954.2

CEDA B+Tree with RAM disk

num txnsnum records per txnmapped value size (bytes)total data (bytes)database size (bytes)space overhead per record (bytes)write data time (s)read data time (s)write record rate (kops/s)read record rate (kops/s)effective write rate (MB/s)effective read rate (MB/s)
1,0001,000412,000,00012,582,9120.60.200.115,1119,00358.5103.0
1,0001,000816,000,00016,252,9280.30.200.115,0208,72376.6133.1
1,0001,0001624,000,00024,641,5360.60.210.124,6798,017107.1183.5
1,0001,0003240,000,00040,370,1760.40.250.154,0346,823153.9260.3
1,0001,0006472,000,00072,351,7440.40.330.193,0345,351208.3367.4
1,0001,000128136,000,000136,314,8800.30.450.272,2403,728290.5483.5
1,0001,000256264,000,000264,241,1520.20.680.411,4682,454369.7617.8
1,0001,000512520,000,000520,617,9840.61.150.728701,382431.3685.1
1,0001,0001,0241,032,000,0001,032,323,0720.32.281.29438775431.4762.4
1,0001,0002,0482,056,000,0002,056,257,5360.34.422.62226382443.5748.7
1,0001,0004,0964,104,000,0004,104,650,7520.79.006.68111150435.0585.8
1,00010,0004120,000,000121,110,5280.12.271.494,4096,72250.576.9
1,00010,0008160,000,000160,956,4160.12.331.534,2926,53765.599.8
1,00010,00016240,000,000240,648,1920.12.471.624,0476,18292.6141.5
1,00010,00032400,000,000401,080,3200.12.822.063,5484,844135.3184.8
1,00010,00064720,000,000720,896,0000.13.452.202,8974,556198.9312.8
1,00010,0001281,360,000,0001,361,051,6480.14.772.942,0963,400271.8440.9
1,00010,0002562,640,000,0002,640,838,6560.19.034.801,1082,085278.9525.0
1,000100,00041,200,000,0001,205,862,4000.129.0120.783,4474,81239.455.1
1,000100,00081,600,000,0001,605,894,1440.128.8221.363,4704,68253.071.5
1,000100,000162,400,000,0002,405,957,6320.140.6833.562,4582,98056.368.2
1,000100,000324,000,000,0004,006,084,6080.140.9326.432,4433,78493.2144.3
1,000300,00043,600,000,0003,617,062,9120.187.1055.473,4445,40839.461.9

Side by side comparisons on RAM disk for 1M records

The following comparisons are between the CEDA LSS, CEDA B+Tree, Kyoto HashDB and Kyoto TreeDB on the same machine, built with the same compiler, for 1000 transactions and 1000000 records, on the 4GB RAM disk.

Note that there is no data for Kyoto TreeDB for mapped values of 1024, 2048 and 4096 bytes because in those cases the 4GB RAM disk wasn't large enough for the database.

Write octets rate

Comparison of the effective write rate in MB/second (i.e. the rate for which real data in the keys and values is written).

write rate ceda vs kyoto hashdb

Write records rate

Comparison of the rate at which records are written in units of 1000 records/second.

write rate ceda vs kyoto hashdb

Read octets rate

Comparison of the effective read rate in MB/second.

read rate ceda vs kyoto hashdb

Read records rate

Comparison of the rate at which records are read in units of 1000 records/second.

read rate ceda vs kyoto hashdb

Space overhead

These plots show the average space overhead in bytes per key,value pair for different sizes of the mapped value.

disk space overhead ceda vs kyoto hashdb

Taking out the Kyoto TreeDB allows us to see the space overheads of the other databases more clearly. The space overhead for the CEDA LSS is between 13 and 14 bytes, and less than one byte for the CEDA B+Tree.

disk space overhead ceda vs kyoto hashdb