{"id":111190,"date":"2024-05-29T15:54:32","date_gmt":"2024-05-29T15:54:32","guid":{"rendered":"https:\/\/foojay.io\/?p=111190"},"modified":"2024-12-30T10:55:24","modified_gmt":"2024-12-30T10:55:24","slug":"indexing-all-of-wikipedia-on-a-laptop","status":"publish","type":"post","link":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/","title":{"rendered":"Indexing all of Wikipedia, on a laptop"},"content":{"rendered":"\n    <div class=\"article__table\">\n        <div class=\"article__table-header\">\n            <svg width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                <path d=\"M8 6H21\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                <path d=\"M8 12H21\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                <path d=\"M8 18H21\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                <path d=\"M3 6H3.01\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                <path d=\"M3 12H3.01\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                <path d=\"M3 18H3.01\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n            <\/svg>\n            Table of Contents\n            <svg class=\"chevron\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                <path d=\"M18 15L12 9L6 15\" stroke=\"#3562E5\" stroke-width=\"1.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\/>\n            <\/svg>\n        <\/div>\n        <div class=\"article__table-body\"><span><a href=\"#h2-0--ompression-parameters\">Compression parameters<\/a><\/span><span><a href=\"#h2-1--raph-ndex-uilder\">GraphIndexBuilder<\/a><\/span><span><a href=\"#h2-2--hronicle-ap-and-ow-ata\">Chronicle Map and RowData<\/a><\/span><span><a href=\"#h2-3--ngesting-the-data\">Ingesting the data<\/a><\/span><span><a href=\"#h2-4--oading-the-index-after-construction-\">Loading the index (after construction)<\/a><\/span><span><a href=\"#h2-5--erforming-a-search\">Performing a search<\/a><\/span><\/div><\/div><!DOCTYPE html PUBLIC \"-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN\" \"http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd\">\n<?xml encoding=\"utf-8\" ?><html><body><p class=\"has-text-align-center\"><div class=\"homepage-today__guide homepage-today__guide--w-image\"\n     data-entry=\"123092\"\n     data-current=\"111190\"\n     style=\"border-color:#29184e;color:#ffffff\"\n    >\n    <div class=\"homepage-today__guide-title-container\">\n        <div class=\"homepage-today__guide-label\">Sponsored Content<\/div>                    <h2 class=\"homepage-today__guide-title\">AI4J - The Intelligent Java Conference<\/h2>\n                <p class=\"homepage-today__guide-description\">\n            This exclusive virtual event brings together leading AI innovators and renowned Java Champions to unpack what\u2019s changing today, what\u2019s coming next, and how enterprise Java teams can stay ahead. April 14, 2026 @ 9am PDT | 12pm PDT.        <\/p>\n\t                <a href=\"https:\/\/www.azul.com\/webinars\/ai4j-intelligent-java-conference\/register\/\"\n               target=\"_blank\"\n               class=\"homepage-today__guide-btn\"\n                >\n                Register Now                <svg\n                        xmlns=\"http:\/\/www.w3.org\/2000\/svg\"\n                        width=\"16\"\n                        height=\"16\"\n                        viewBox=\"0 0 16 16\"\n                        fill=\"none\"\n                >\n                    <path\n                            d=\"M3.33325 8H12.6666\"\n                            stroke=\"white\"\n                            stroke-width=\"1.5\"\n                            stroke-linecap=\"round\"\n                            stroke-linejoin=\"round\"\n                    \/>\n                    <path\n                            d=\"M8 3.33331L12.6667 7.99998L8 12.6666\"\n                            stroke=\"white\"\n                            stroke-width=\"1.5\"\n                            stroke-linecap=\"round\"\n                            stroke-linejoin=\"round\"\n                    \/>\n                <\/svg>\n            <\/a>\n            <\/div>\n    <div class=\"homepage-today__guide-img-container\">\n        <img loading=\"lazy\" decoding=\"async\" width=\"610\" height=\"510\" src=\"https:\/\/foojay.io\/wp-content\/uploads\/2026\/03\/AI4J-The-Intelligent-Java-Conference-1-610x510.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"\" srcset=\"https:\/\/foojay.io\/wp-content\/uploads\/2026\/03\/AI4J-The-Intelligent-Java-Conference-1-610x510.jpg 610w, https:\/\/foojay.io\/wp-content\/uploads\/2026\/03\/AI4J-The-Intelligent-Java-Conference-1.jpg 710w\" sizes=\"auto, (max-width: 610px) 100vw, 610px\" \/>    <\/div>\n<\/div>\n<\/p>\n\n\n\n<p>In November, <a target=\"_blank\" href=\"https:\/\/huggingface.co\/datasets\/Cohere\/wikipedia-2023-11-embed-multilingual-v3\">Cohere released a dataset containing all of Wikipedia<\/a>, chunked and embedded to vectors with <a target=\"_blank\" href=\"https:\/\/cohere.com\/blog\/introducing-embed-v3\">their multilingual-v3 model<\/a>. <\/p>\n\n\n\n<p>Computing this many embeddings yourself would cost in the neighborhood of $5000, so the public release of this dataset makes creating <a target=\"_blank\" href=\"https:\/\/www.datastax.com\/guides\/what-is-vector-search\">a semantic, vector-based index<\/a> of Wikipedia practical for an individual for the first time.<\/p>\n\n\n\n<p>Here&rsquo;s what we&rsquo;re building:&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/ydeHYk97v6Bza1GF0wbbHUEzxgCAJLfwbRcVnWvUP6QDPKKY5YQH00Dvi2n6VgkioW_PGqwckcCnQu9cJ2nOz2XSuL_27HNPAAbZdv2vXPOy_vUJ_Vcg-ii83E4jaqMycskzmzt8wBP1XsOYh5b7Cv4\" alt=\"\"><\/figure>\n\n\n\n<p>You can try searching the completed index <a target=\"_blank\" href=\"https:\/\/jvectordemo.com:8443\/\">on a public demo instance here<\/a>.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Why this is hard<\/h1>\n\n\n\n<p>Sure, the dataset is big (180GB for the English corpus), but that&rsquo;s not the obstacle per se.&nbsp; We&rsquo;ve been able to build full-text indexes on larger datasets for a long time.<\/p>\n\n\n\n<p>The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction.&nbsp; Larger datasets could be split into <a target=\"_blank\" href=\"https:\/\/stackoverflow.com\/questions\/2703432\/what-are-segments-in-lucene\">segments<\/a>, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall.&nbsp; (In their latest release, <a target=\"_blank\" href=\"https:\/\/www.elastic.co\/search-labs\/blog\/elasticsearch-lucene-vector-database-gains\">Lucene attempts to mitigate this by processing segments in parallel with multiple threads<\/a>, but obviously (1) this only gives you a constant factor of improvement before you run out of CPU cores and (2) this does not improve throughput.)<\/p>\n\n\n\n<p>Specifically, if you&rsquo;re indexing 1536-dimension vectors (the size of ada002 or openai-v3-small), then you can fit about 5M vectors and their associated edge lists in a 32GB index construction RAM budget.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/-BwVEUQqMIDekxlKXgiuOiQcycoM_fP3ncRjVNgRD7W7SzcsTigI4tjsmE-S4x35PIgEpwNxVioZxD50ah2PzQXuVCo22TXiI80EFpjpnCf4X-JjTPBb6FqVC4CJFdrYkoG6aLYxFNotM_MX_NoIpDk\" alt=\"\"><\/figure>\n\n\n\n<p><a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/jvector\/\">JVector<\/a>, the library that powers <a target=\"_blank\" href=\"https:\/\/www.datastax.com\/products\/datastax-astra\">DataStax Astra<\/a> vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors.&nbsp; This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Requirements<\/h1>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Linux or MacOS.&nbsp; It will not work on Windows because ChronicleMap, which we are going to use for the non-vector data, is limited to a 4GB size there.&nbsp; (If you are interested enough, you could shard the Map by vector id to keep each shard under 4GB and still have O(1) lookup times.)<\/li>\n\n\n\n<li>About 180GB of free space for the dataset, and 90GB for the completed index.<\/li>\n\n\n\n<li>Enough RAM to run a JVM with 36GB of heap space during construction (~28GB for the index, 8GB for GC headroom).<\/li>\n\n\n\n<li>Disable swap before building the index.&nbsp; Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive.&nbsp; In my test, building with swap enabled was almost twice as slow as with it off.<\/li>\n<\/ol>\n\n\n\n<h1 class=\"wp-block-heading\">Building and searching the index<\/h1>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Check out the project:<br>$ git clone <a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/coherepedia-jvector\">https:\/\/github.com\/jbellis\/coherepedia-jvector<br><\/a>$ cd coherepedia-jvector<\/li>\n\n\n\n<li>Edit <em>config.properties<\/em> to set the locations for the dataset and the index.&nbsp;<\/li>\n\n\n\n<li>Run <em>pip install datasets<\/em>.&nbsp; (Setting up a <a target=\"_blank\" href=\"https:\/\/docs.python.org\/3\/library\/venv.html\">venv<\/a> or conda environment first is recommended but not strictly necessary.)<\/li>\n\n\n\n<li>Run <em>python download.py.&nbsp; <\/em>This downloads the 180 GB dataset to the location you configured.&nbsp; For me that took about half an hour.<\/li>\n\n\n\n<li>Run <em>.\/mvnw compile exec:exec@buildindex.<\/em>&nbsp; This took about 5 and a half hours on my machine (with an i9-12900 CPU).<\/li>\n\n\n\n<li>Run <em>.\/mvnw compile exec:exec@serve <\/em>and open a browser to <a target=\"_blank\" href=\"http:\/\/localhost:4567\/\">http:\/\/localhost:4567<\/a>.&nbsp; Search away!<\/li>\n<\/ol>\n\n\n\n<h1 class=\"wp-block-heading\">How it works<\/h1>\n\n\n\n<p>We&rsquo;re using <a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/jvector\">JVector<\/a> for the vector index and <a target=\"_blank\" href=\"https:\/\/github.com\/OpenHFT\/Chronicle-Map\">Chronicle Map<\/a> for the article data.&nbsp; There are <a target=\"_blank\" href=\"https:\/\/github.com\/OpenHFT\/Chronicle-Map\/issues\/533\">several<\/a> <a target=\"_blank\" href=\"https:\/\/github.com\/OpenHFT\/Chronicle-Map\/issues\/537\">things<\/a> I don&rsquo;t love about Chronicle Map, but nothing else touches it for simple disk-based key\/value performance.<\/p>\n\n\n\n<p>The full source of the index construction class is <a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/coherepedia-jvector\/blob\/master\/src\/main\/java\/io\/github\/jbellis\/BuildIndex.java\">here<\/a>.&nbsp; I&rsquo;ll explain it next in pieces.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h2-0--ompression-parameters\">Compression parameters<\/h2>\n\n\n\n<p>JVector is based on the <a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node\/\">DiskANN<\/a> vector index design, which performs an initial search using vectors compressed lossily with <a target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/similarity-search-product-quantization-b2a1a6397701\">product quantization (PQ)<\/a> in memory, then reranks the results using high-resolution vectors from disk.&nbsp; However, while DiskANN stores full, uncompressed vectors to perform reranking, JVector is able to improve on that using <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2402.02044\">Locally-Adaptive Quantization (LVQ)<\/a> compression.<\/p>\n\n\n\n<p>To set this up, we&rsquo;ll first load some vectors into a RandomAccessVectorValues (RAVV) instance.&nbsp; RAVV is a JVector interface for a vector container; it could be List or Map based, in-memory or on-disk.&nbsp; In this case we&rsquo;ll use a simple List-backed RAVV.&nbsp; We&rsquo;ll compute the parameters for both compressions (kmeans clustering for PQ, global mean for LVQ) from a single shard of the dataset.&nbsp; At about 110k rows, this is enough data to have a statistically valid sample.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/aN200b1SULDTwrg5inwDuNFKLCVyYstVuYOSXLqAos2D_psAoMp8V5CXjXDKCEKcCZc5JyM7U27qg7LPp14mfQh9nktRzXaXE4pteHFINO-HPS_xxW4ESxf1glxanb5gG2xoAmx1r2qaiReZXcFI--4\" alt=\"\"><\/figure>\n\n\n\n<p>Next, we compute the PQ compression codebook; we&rsquo;re compressing the vectors by a factor of 64, because the Cohere v3 embeddings can be PQ-compressed that much without losing accuracy, after reranking.&nbsp; <a target=\"_blank\" href=\"https:\/\/thenewstack.io\/why-vector-size-matters\/\">Binary Quantization only gives us 32x compression and is less accurate<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/PS0HlbtZNajjlTe9AFg1yoW7fvGyKeSpHGwfk3_k5dHs08QOkTphXeO03AO2Chx-mxw5lV2wD81xo3lNGB9raJojFYrg6z2-OTIA05fUfVHzpGIM12R-veeTPLirOhjGTvcM-Uch31c5SZmgGDbiIrM\" alt=\"\"><\/figure>\n\n\n\n<p>Finally, we need to set up LVQ.&nbsp; LVQ gives us 4x compression while losing no measurable accuracy over the full uncompressed vectors, resulting in both a smaller footprint on disk and faster searches.&nbsp; (I thank the vector search team at Intel Research for pointing this out to us.)<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/XheHrYXEE6j_GaROcmgI_0-OFJx9GJes1uVcEGDcYFUvi0Gu3ZqXgpqV38iMbxL25JvCmIcFRsxG8EoqZ2aT332JWYAwSeRHnKPzY-un5LO2eun1Eio0ZTya312IXv_AV1xJ88HUT6Fxb96uNtFokGU\" alt=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h2-1--raph-ndex-uilder\">GraphIndexBuilder<\/h2>\n\n\n\n<p>Next, we need to instantiate and configure our GraphIndexBuilder.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/veV6oVgpkDyr-WLPIMzzTtHD0q8MIT3sQxOauqdXwzXExFBQ2FD9btPpVXf-DTuk0OEJAVWpHf6IduBDIiyGSyDwdsEICTyoTjUocG7PgkxIRiMIpIRPpGjiSFoKm9Z-B0vOU4uYRtPsew1Oi3f_bis\" alt=\"\"><\/figure>\n\n\n\n<p>This instantiates a JVector GraphIndexBuilder and connects it to an OnDiskGraphIndexWriter, and tells it to use the PQ-compressed vectors list (which starts empty and will grow as we add vectors to the index) during construction (in the BuildScoreProvider).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h2-2--hronicle-ap-and-ow-ata\">Chronicle Map and RowData<\/h2>\n\n\n\n<p>We&rsquo;ll store article contents in RowData records.&nbsp; This content is what has been encoded as the corresponding vector in the dataset, and is what we want to return to the user in our search results.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/veVvO8QUrY_k_YGDwavo_dBaIoM5ZGGfaN5dowCroJAgJv-37JZIWq0jX78rY0R8g6wvRO1QxvTv-dMuEVMJRvmrvdbmLAHlBJqUd9yoyIXD0DADDlZQXyZcyLPcp-F4zAcRb1obXtvJO6d4oXTGD7M\" alt=\"\"><\/figure>\n\n\n\n<p>To turn the vector index&rsquo;s search results (a list of integer vector ids) into RowData, we store the RowData in a Map keyed by the vector id.&nbsp; This will be a lot of data, so we use <a target=\"_blank\" href=\"https:\/\/github.com\/OpenHFT\/Chronicle-Map\">ChronicleMap<\/a> to store this on disk with a minimal in-memory footprint.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/hFMjxcQstglWY0IjbgKqiHB9dk7KlKATQnBIBLZh_hGvdsuo6_UDQi8ydn3RA0ELYpJlng0HERqxUG1nmpj5HRNFPRhIHhhOtnC6vc7XHsIZnwI-fcyRK8gNnPeKpLUUQNVjGnK9EP1RHU6UPs0SLsw\" alt=\"\"><\/figure>\n\n\n\n<p>We need to tell ChronicleMap how large it&rsquo;s going to be, both in entry count and entry size.&nbsp; Undersizing these will cause it to crash (<a target=\"_blank\" href=\"https:\/\/github.com\/OpenHFT\/Chronicle-Map\/issues\/533\">my primary complaint<\/a> about ChronicleMap), so we deliberately use a high estimate.<\/p>\n\n\n\n<p>We <em>do not<\/em> need to explicitly tell ChronicleMap how to read and write RowData objects, instead we just have RowData implement Serializable.&nbsp; While ChronicleMap supports custom de\/serialize code, it&rsquo;s perfectly happy to use simple out-of-the-box serialization and since profiling shows that&rsquo;s not a bottleneck for us we&rsquo;ll just leave it at that.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h2-3--ngesting-the-data\">Ingesting the data<\/h2>\n\n\n\n<p>We use Java&rsquo;s parallel Streams to process the shards in parallel.&nbsp; For each row in each shard, we<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Add it to <em>pqVectorsList<\/em><\/li>\n\n\n\n<li>Call <em>writer.writeInline<\/em> to add the LVQ-compressed vector to disk<\/li>\n\n\n\n<li>Call <em>builder.addGraphNode <\/em>&ndash; order is important because both (1) and (2) are used when we call addGraphNode<\/li>\n\n\n\n<li>Call <em>contentMap.put<\/em> with the article chunk data.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/5souUR9e_gbEEdcUwWvq8_cjleyFGglaQaCSV-XFkv-3Ij7cGYgd13UcyGdwIYE6Xw5zD4WiFSxGO1phrEK8w6UWx6BanZVWXQ4oBnkHdh6aEFB4DIllhK15HjZJ9iJyQKV5ts9QTLQqF3uufChXdPE\" alt=\"\"><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/_Z7fqvsbvQ2kXJY286iM-ysvQJCHwlaWgmBACBFZfSOscrSbGtYMkGOlKbA5cWuaB4-M_aN1Y6idM1pWEEvUvQJh-22d71eAxAR5hxZNwBy1dedc1DIApiKTfSpLQMsIIzN3yozbSPProh5TJT_TkvU\" alt=\"\"><\/figure>\n\n\n\n<p>You can look at <a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/coherepedia-jvector\/blob\/master\/src\/main\/java\/io\/github\/jbellis\/BuildIndex.java\">the full source<\/a> if you&rsquo;re curious about <em>forEachRow<\/em>, it&rsquo;s just standard &ldquo;pull data out of Arrow&rdquo; stuff.<\/p>\n\n\n\n<p>When the build completes, you should see files like this:<\/p>\n\n\n\n<p>$ ls -lh ~\/coherepedia<\/p>\n\n\n\n<p>-rw-rw-r-- 1 jonathan jonathan&nbsp; 48G May 20 15:53 coherepedia.ann<\/p>\n\n\n\n<p>-rw-rw-r-- 1 jonathan jonathan&nbsp; 36G May 20 18:05 coherepedia.map<\/p>\n\n\n\n<p>-rw-rw-r-- 1 jonathan jonathan 2.5G May 20 15:53 coherepedia.pqv<\/p>\n\n\n\n<p>-rw-rw-r-- 1 jonathan jonathan 4.1K May 17 23:04 coherepedia.lvq<\/p>\n\n\n\n<p>-rw-rw-r-- 1 jonathan jonathan 1.1M May 17 23:04 coherepedia.pq<\/p>\n\n\n\n<p>These are respectively<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ANN: the vector index, containing the edge lists and LVQ-compressed vectors for reranking.<\/li>\n\n\n\n<li>MAP: the map containing article data indexed by vector id.<\/li>\n\n\n\n<li>PQV: PQ-compressed vectors, which are read into memory and used for the approximate search pass.<\/li>\n\n\n\n<li>LVQ: the LVQ global mean, used during construction.<\/li>\n\n\n\n<li>PQ: the PQ codebooks, used during construction.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h2-4--oading-the-index-after-construction-\">Loading the index (after construction)<\/h2>\n\n\n\n<p>The code for serving queries is found in the <a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/coherepedia-jvector\/blob\/master\/src\/main\/java\/io\/github\/jbellis\/WebSearch.java\">WebSearch<\/a> class.&nbsp; We&rsquo;re using Spark (<a target=\"_blank\" href=\"https:\/\/sparkjava.com\/\">the web framework<\/a>, not the big data engine) to serve a simple search form:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/a-y2F-t0K9ph-4-0ERwSLy7-xhLDMQZD1qz7FU8tDPvj6w1MUhkhznWEksElvPh_1twzn68B8nD6q6wheKlAqxUyyghNhPmxDEs69fYiKTKEtILwwuFhSPNmsDVhS395kDu3hlggzUQIKtG0S_PJxRw\" alt=\"\"><\/figure>\n\n\n\n<p>Construction needed a relatively large heap to keep the edge lists in memory.&nbsp; With that complete, we only need enough to keep the PQ-compressed vectors in memory; <em>exec@serve <\/em>is configured to use a 4GB heap.<\/p>\n\n\n\n<p>WebSearch (<a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/coherepedia-jvector\/blob\/master\/src\/main\/java\/io\/github\/jbellis\/WebSearch.java\">the class behind <em>exec@serve<\/em><\/a>) first has a static initializer to load the PQ vectors and open the ChronicleMap.&nbsp; We also create a reusable GraphSearcher instance:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/ofHaU8px5jnF0FCupz_mJt4CMc1Bg8Lul36DcScuviM3IPj8UnL7FKD-TMnUh3Lyn41n0Krn_FoooHNaJjf_112xF44SZk9BPe5O-74tuF8VwrmCVEeB571RGYJ-DILbeq4qGFN1MHZQaqxI6v-U8Bw\" alt=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h2-5--erforming-a-search\">Performing a search<\/h2>\n\n\n\n<p>Executing a search and turning it into RowData for the user looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/-bY_lTq_EAXUwfuP_MEsLYcuwwdx13wKCCYAAL83KaxSQ1x8VBbAjlbqGWCxL998vVAlBfEmOxTXZIRkJp8-uTLb0FLXrvCdGICWggC13UKXPCjBq42D5guoHk5IvShjzgpf1IvD2JYcQiIJnyDKedo\" alt=\"\"><\/figure>\n\n\n\n<p>There are four &ldquo;paragraphs&rdquo; of code here, containing<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The call to <em>getVectorEmbedding<\/em>.&nbsp; This calls Cohere&rsquo;s API to turn the search query (a String) into a vector embedding.<\/li>\n\n\n\n<li>Creating approximate and reranking score functions.&nbsp; Approximate scoring is done through our product quantization, and reranking is done with the LVQ vectors in the index.&nbsp; Since the LVQ vectors are encapsulated in the index itself, we never need to explicitly deal with LVQ decoding; the index does it for us.<\/li>\n\n\n\n<li>The call to <em>searcher.search <\/em>that actually does the query, and finally<\/li>\n\n\n\n<li>Retrieving the RowData associated with the top vector neighbors using <em>contentMap<\/em>.<\/li>\n<\/ol>\n\n\n\n<p>That&rsquo;s it!&nbsp; We&rsquo;ve indexed all of Wikipedia with high performance, parallel code in about 150 loc, and created a simple search server in another 100.<\/p>\n\n\n\n<p>On my machine, searches (which each run in a single thread) take about 50ms.&nbsp; We would expect it to take over twice as long if this were split across multiple segments.&nbsp; We would also expect it to lose significant accuracy if searches were performed only with compressed vectors without reranking.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Conclusion<\/h1>\n\n\n\n<p>Indexing the entirety of English Wikipedia on a laptop has become a practical reality thanks to recent advances in the JVector library that will be part of the imminent 3.0 release.&nbsp; (<a target=\"_blank\" href=\"https:\/\/github.com\/jbellis\/jvector\">Star the repo<\/a> and stand by!)&nbsp; This article demonstrates how to do exactly that using JVector in conjunction with Chronicle Map, while also showcasing the use of <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2402.02044\">LVQ<\/a> to reduce index size while preserving <a target=\"_blank\" href=\"https:\/\/thenewstack.io\/why-vector-size-matters\/\">accurate reranking<\/a>.<\/p>\n\n\n\n<p>To take advantage of the power of JVector alongside powerful indexing for non-vector data, rolled into a document api with support for realtime inserts, updates, and deletes, check out the <a target=\"_blank\" href=\"https:\/\/www.datastax.com\/products\/datastax-astra\">DataStax Astra<\/a> service.<\/p>\n\n\n\n<p>Enjoy hacking with JVector and Astra!<\/p>\n\n\n\n<p class=\"has-text-align-center\"><div class=\"homepage-today__guide homepage-today__guide--w-image\"\n     data-entry=\"116669\"\n     data-current=\"111190\"\n     style=\"border-color:#E3E3EE;color:#000000\"\n    >\n    <div class=\"homepage-today__guide-title-container\">\n                            <h2 class=\"homepage-today__guide-title\">Prime Time: The High-Performance Java Event<\/h2>\n                <p class=\"homepage-today__guide-description\">\n            Join industry experts from the Java community for a free even dedicated to modernizing Java workloads. When performance, elasticity, scale, and cloud cost matter, make it Prime!        <\/p>\n\t                <a href=\"https:\/\/www.azul.com\/webinars\/prime-time\/register\/?utm_medium=foojay&amp;utm_campaign=Prime-Time&amp;utm_source=foojay&amp;utm_content=&amp;utm_term=\"\n               target=\"_blank\"\n               class=\"homepage-today__guide-btn\"\n                >\n                Register Now                <svg\n                        xmlns=\"http:\/\/www.w3.org\/2000\/svg\"\n                        width=\"16\"\n                        height=\"16\"\n                        viewBox=\"0 0 16 16\"\n                        fill=\"none\"\n                >\n                    <path\n                            d=\"M3.33325 8H12.6666\"\n                            stroke=\"white\"\n                            stroke-width=\"1.5\"\n                            stroke-linecap=\"round\"\n                            stroke-linejoin=\"round\"\n                    \/>\n                    <path\n                            d=\"M8 3.33331L12.6667 7.99998L8 12.6666\"\n                            stroke=\"white\"\n                            stroke-width=\"1.5\"\n                            stroke-linecap=\"round\"\n                            stroke-linejoin=\"round\"\n                    \/>\n                <\/svg>\n            <\/a>\n            <\/div>\n    <div class=\"homepage-today__guide-img-container\">\n        <img loading=\"lazy\" decoding=\"async\" width=\"498\" height=\"472\" src=\"https:\/\/foojay.io\/wp-content\/uploads\/2025\/06\/unlock.png\" class=\"attachment-medium size-medium wp-post-image\" alt=\"\" \/>    <\/div>\n<\/div>\n<\/p>\n<\/body><\/html>\n","protected":false},"excerpt":{"rendered":"<p>Indexing the entirety of English Wikipedia on a laptop has become a practical reality thanks to recent advances in the JVector library that will be part of the imminent 3.0 release.<\/p>\n","protected":false},"author":191,"featured_media":111191,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[565,198,220],"tags":[33,2014,35,403,2015,637],"class_list":["post-111190","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-datastax","category-performance","category-tools","tag-java","tag-jvector","tag-openjdk","tag-performance","tag-search","tag-wikipedia"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Indexing all of Wikipedia, on a laptop<\/title>\n<meta name=\"description\" content=\"foojay is the place for all OpenJDK Update Release Information. Learn More.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Indexing all of Wikipedia, on a laptop\" \/>\n<meta property=\"og:description\" content=\"foojay is the place for all OpenJDK Update Release Information. Learn More.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/\" \/>\n<meta property=\"og:site_name\" content=\"foojay\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-29T15:54:32+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-12-30T10:55:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/foojay.io\/wp-content\/uploads\/2024\/05\/wikiindex.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1416\" \/>\n\t<meta property=\"og:image:height\" content=\"1600\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Jonathan Ellis\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jonathan Ellis\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/\"},\"author\":{\"name\":\"Jonathan Ellis\",\"@id\":\"https:\\\/\\\/foojay.io\\\/#\\\/schema\\\/person\\\/5b4166dfd492b4cb9fea3f917af84c8a\"},\"headline\":\"Indexing all of Wikipedia, on a laptop\",\"datePublished\":\"2024-05-29T15:54:32+00:00\",\"dateModified\":\"2024-12-30T10:55:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/\"},\"wordCount\":1698,\"commentCount\":8,\"publisher\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2024\\\/05\\\/wikiindex.png\",\"keywords\":[\"Java\",\"JVector\",\"OpenJDK\",\"performance\",\"Search\",\"wikipedia\"],\"articleSection\":[\"DataStax\",\"Performance\",\"Tools\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/\",\"url\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/\",\"name\":\"Indexing all of Wikipedia, on a laptop\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2024\\\/05\\\/wikiindex.png\",\"datePublished\":\"2024-05-29T15:54:32+00:00\",\"dateModified\":\"2024-12-30T10:55:24+00:00\",\"description\":\"foojay is the place for all OpenJDK Update Release Information. Learn More.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#primaryimage\",\"url\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2024\\\/05\\\/wikiindex.png\",\"contentUrl\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2024\\\/05\\\/wikiindex.png\",\"width\":1416,\"height\":1600},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/foojay.io\\\/today\\\/indexing-all-of-wikipedia-on-a-laptop\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/foojay.io\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Indexing all of Wikipedia, on a laptop\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/foojay.io\\\/#website\",\"url\":\"https:\\\/\\\/foojay.io\\\/\",\"name\":\"foojay\",\"description\":\"a place for friends of OpenJDK\",\"publisher\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/foojay.io\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/foojay.io\\\/#organization\",\"name\":\"foojay\",\"url\":\"https:\\\/\\\/foojay.io\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/foojay.io\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/cropped-Favicon.png\",\"contentUrl\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/cropped-Favicon.png\",\"width\":512,\"height\":512,\"caption\":\"foojay\"},\"image\":{\"@id\":\"https:\\\/\\\/foojay.io\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/foojay2020\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/foojay.io\\\/#\\\/schema\\\/person\\\/5b4166dfd492b4cb9fea3f917af84c8a\",\"name\":\"Jonathan Ellis\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/jbellis-headshot-96x96.png\",\"url\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/jbellis-headshot-96x96.png\",\"contentUrl\":\"https:\\\/\\\/foojay.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/jbellis-headshot-96x96.png\",\"caption\":\"Jonathan Ellis\"},\"description\":\"Jonathan is the founder of Brokk (https:\\\/\\\/brokk.ai). Brokk keeps LLMs on-task in million-line codebases by adding compiler-grade understanding of your code's structure and semantics. Jonathan is also the author of JVector, co-founder of DataStax, and the founding project chair of Apache Cassandra.\",\"sameAs\":[\"https:\\\/\\\/brokk.ai\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/jbellis\\\/\",\"https:\\\/\\\/x.com\\\/spyced\",\"https:\\\/\\\/www.youtube.com\\\/@Brokk_AI\"],\"url\":\"https:\\\/\\\/foojay.io\\\/today\\\/author\\\/jbellis\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Indexing all of Wikipedia, on a laptop","description":"foojay is the place for all OpenJDK Update Release Information. Learn More.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/","og_locale":"en_US","og_type":"article","og_title":"Indexing all of Wikipedia, on a laptop","og_description":"foojay is the place for all OpenJDK Update Release Information. Learn More.","og_url":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/","og_site_name":"foojay","article_published_time":"2024-05-29T15:54:32+00:00","article_modified_time":"2024-12-30T10:55:24+00:00","og_image":[{"width":1416,"height":1600,"url":"https:\/\/foojay.io\/wp-content\/uploads\/2024\/05\/wikiindex.png","type":"image\/png"}],"author":"Jonathan Ellis","twitter_misc":{"Written by":"Jonathan Ellis","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#article","isPartOf":{"@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/"},"author":{"name":"Jonathan Ellis","@id":"https:\/\/foojay.io\/#\/schema\/person\/5b4166dfd492b4cb9fea3f917af84c8a"},"headline":"Indexing all of Wikipedia, on a laptop","datePublished":"2024-05-29T15:54:32+00:00","dateModified":"2024-12-30T10:55:24+00:00","mainEntityOfPage":{"@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/"},"wordCount":1698,"commentCount":8,"publisher":{"@id":"https:\/\/foojay.io\/#organization"},"image":{"@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#primaryimage"},"thumbnailUrl":"https:\/\/foojay.io\/wp-content\/uploads\/2024\/05\/wikiindex.png","keywords":["Java","JVector","OpenJDK","performance","Search","wikipedia"],"articleSection":["DataStax","Performance","Tools"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/","url":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/","name":"Indexing all of Wikipedia, on a laptop","isPartOf":{"@id":"https:\/\/foojay.io\/#website"},"primaryImageOfPage":{"@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#primaryimage"},"image":{"@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#primaryimage"},"thumbnailUrl":"https:\/\/foojay.io\/wp-content\/uploads\/2024\/05\/wikiindex.png","datePublished":"2024-05-29T15:54:32+00:00","dateModified":"2024-12-30T10:55:24+00:00","description":"foojay is the place for all OpenJDK Update Release Information. Learn More.","breadcrumb":{"@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#primaryimage","url":"https:\/\/foojay.io\/wp-content\/uploads\/2024\/05\/wikiindex.png","contentUrl":"https:\/\/foojay.io\/wp-content\/uploads\/2024\/05\/wikiindex.png","width":1416,"height":1600},{"@type":"BreadcrumbList","@id":"https:\/\/foojay.io\/today\/indexing-all-of-wikipedia-on-a-laptop\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/foojay.io\/"},{"@type":"ListItem","position":2,"name":"Indexing all of Wikipedia, on a laptop"}]},{"@type":"WebSite","@id":"https:\/\/foojay.io\/#website","url":"https:\/\/foojay.io\/","name":"foojay","description":"a place for friends of OpenJDK","publisher":{"@id":"https:\/\/foojay.io\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/foojay.io\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/foojay.io\/#organization","name":"foojay","url":"https:\/\/foojay.io\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/foojay.io\/#\/schema\/logo\/image\/","url":"https:\/\/foojay.io\/wp-content\/uploads\/2020\/04\/cropped-Favicon.png","contentUrl":"https:\/\/foojay.io\/wp-content\/uploads\/2020\/04\/cropped-Favicon.png","width":512,"height":512,"caption":"foojay"},"image":{"@id":"https:\/\/foojay.io\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/foojay2020"]},{"@type":"Person","@id":"https:\/\/foojay.io\/#\/schema\/person\/5b4166dfd492b4cb9fea3f917af84c8a","name":"Jonathan Ellis","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/foojay.io\/wp-content\/uploads\/2025\/05\/jbellis-headshot-96x96.png","url":"https:\/\/foojay.io\/wp-content\/uploads\/2025\/05\/jbellis-headshot-96x96.png","contentUrl":"https:\/\/foojay.io\/wp-content\/uploads\/2025\/05\/jbellis-headshot-96x96.png","caption":"Jonathan Ellis"},"description":"Jonathan is the founder of Brokk (https:\/\/brokk.ai). Brokk keeps LLMs on-task in million-line codebases by adding compiler-grade understanding of your code's structure and semantics. Jonathan is also the author of JVector, co-founder of DataStax, and the founding project chair of Apache Cassandra.","sameAs":["https:\/\/brokk.ai","https:\/\/www.linkedin.com\/in\/jbellis\/","https:\/\/x.com\/spyced","https:\/\/www.youtube.com\/@Brokk_AI"],"url":"https:\/\/foojay.io\/today\/author\/jbellis\/"}]}},"_links":{"self":[{"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/posts\/111190","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/users\/191"}],"replies":[{"embeddable":true,"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/comments?post=111190"}],"version-history":[{"count":0,"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/posts\/111190\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/media\/111191"}],"wp:attachment":[{"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/media?parent=111190"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/categories?post=111190"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/foojay.io\/wp-json\/wp\/v2\/tags?post=111190"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}