> B-Trees won’t get obsolete > > <reaons> Among the many reasons that b-trees ar...

hinkley · on March 8, 2025

Software architecture is dominated by a few inequalities that tend to remain in the same order most of the time, get closer or farther away from each other, and only occasionally reorder themselves because some big breakthrough in one kind of hardware has no equivalence in another and things get out of whack.

When they get out of whack we try some really crazy algorithms that make sense when the inequalities invert, and then disappear when order is restored.

One of the first times networking got faster than local storage, John Ousterhout (of Raft fame) and his group at Berkeley introduced a distributed operating system called Sprite (which featured heavily in the Distributed Computing class I took) that could not only run processes on different machines but could also do live process migration - in the 1980's. NICs getting faster than disk also brought us in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage.

B-trees fit well with the dominant equation, so they never really go away.

twotwotwo · on March 9, 2025

> in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage

If we're talking caches and Redis, one thought about why they haven't gone away--

In practice I think sharing is part of what keeps them network services, not only access to a cluster's RAM. As a dev you reach for something shared and 'shaped' like a cache, and the in-RAM aspect may be vestigial but still comes with the package.

We use Redis when e.g. tracking stats on active client IPs to see which are up to shenanigans, a maybe common use where you want cluster-wide stats. When caching, we may still want one DB query per expiry period app-wide, not one per webserver.

We would probably be fine 'caching' some stuff in the DB; Django's cache package has supported it for a while! I'm sure one of the disk-using successors of an in-RAM store could hit perf targets (DragonflyDB looks cool!). Just not a priority to change stuff that's working. (Corollary: I bet some folks unhappy with their Redis RAM needs are more than happy to try Dragonfly.)

hot_gril · on March 7, 2025

Also if you have roughly increasing row IDs, theoretically a btree is faster to insert into at a large scale, even if you don't care for prefix lookups or inequalities. And I think faster to read from.

I'm not sure if that's why, but in practice, Postgres doesn't care too much for hash indexes. It technically has them, but they're non-default, not recommended, previously poorly supported, etc. The default is btree (and serial pk).

cryptonector · on March 8, 2025

But tables in PG are heaps, so appending to them is even faster than in b-trees.

dspillett · on March 8, 2025

This is specifically taking about indexed structures though, so inserting data in a way that means it can be efficiently searched afterwards. Comparing heaps to b-trees, hash tables, and other ones structures in this context isn't really a Granny Smith Vs Golden Delicious style comparison.

Many (though not all) databases allow the base data to be a heap for those situations where a pure heap or a heap with small supporting indexes is more efficient (very wide data for instance, particularly in intermediate structures for ETL processes (a longer process acronym might better describe this, say, ELTEL)).

cryptonector · on March 11, 2025

A SQL table that has a `PRIMARY KEY` is (or can be) an index. Indeed, in SQLite3 if you use the `WITHOUT ROWID` feature you get exactly that: a covering b-tree index where the `PRIMARY KEY` columns are a prefix of the index key.

Also in SQLite3 tables w/o `WITHOUT ROWID` are still b-tree indices, but they are indexed by the row ID, or the `INTEGER PRIMARY KEY` column (which if there is one, is the same thing as a row ID).

I.e., tables can be indices, and therefore covering indices can be tables.

If you have a table that has no `PRIMARY KEY` then a heap makes some sense, but then again so does a b-tree indexing by row ID. But then again a table without a `PRIMARY KEY` actually makes little sense in a world of relational algebra where tables are sets (as they are supposed to be in SQL) because a table without a `PRIMARY KEY` implies allowing duplicate rows, and then you need a row ID to deal with those... A table without a `PRIMARY KEY` is always a table that has an implied `PRIMARY KEY` that is just a row ID.

dspillett · on March 11, 2025

> A table without a `PRIMARY KEY` is always a table that has an implied `PRIMARY KEY` that is just a row ID.

While it is true that there will be a surrogate key (the row ID), that doesn't imply there is any sort of index as there is in SQLite.

In SQL Server a heap table is not represented by a tree indexed by rowID, there will be one or more IAM (Index Allocation Map) pages which just list the pages used by that file. These are not even ordered lists. The internal RowID is actually a 8-byte encoding of the data's location in the files (two bytes for file, four bytes for page, two bytes for slot) for instance “1:560:0” which is file 1, page 560, first slot. This is what is stored in the IAM page and what any indexes point to for finding data that is not part of the index (i.e. not part of the index key and also not “included”). That isn't even necessarily where the data is, just where it was first put: if the row is updated in a way that it grows beyond what will fit in the page it shares with other rows, it goes elsewhere and a forwarding record in the original slot will point to the new location (this can become a long list of jumps to find the actual data, if a row grows many times in its existence, and you are unlucky wrt what free space is available on the current page each time it does).

Yes, this is more than a bit nasty for most uses. That is why most (almost all) tables in SQL Server should have a clustered index (at which point it is no longer a heap). Heaps have their uses where they are better than tables with clustered indexes, but those are few and far between (off the top of my head, some intermediate tables for ETL processes is all I can think of, technically very small (one-or-few page) tables too, but any measured efficiency difference there is so small either way that it is, or may as well be, statistical noise).

Having a primary key does not stop a heap being a head (in SQL Server) as the primary key does not have to be a clustering key. It usually is, but there are times when another key is a better choice and there can only be one clustered key (ignoring the wide index hack that makes things more-or-less behave like there are multiple clustering keys, at the expense of extra storage space and much slower insert/update operations).

cryptonector · on March 11, 2025

> While it is true that there will be a surrogate key (the row ID), that doesn't imply there is any sort of index as there is in SQLite.

In PG heaps it does, though it's not stable across `VACUUM`. From what you say the same is true of SQL Server. The point being that two duplicate rows in a table w/o a `PRIMARY KEY` will have different "row IDs" or "ctids" or whatever even though that row ID or ctid will not be stable across vacuuming though it should be stable across a single query.

(In PG all tables are heaps, always. You can have indices, and even covering indices, and your read queries might never hit the heap, but a heap will be there. IMO PG ought to support a b-tree instead of a heap for tables with `PRIMARY KEY`s -- it would be an improvement, and a performance optimization.)

hot_gril · on March 9, 2025

It was unclear of me to say "row IDs." I meant primary keys, not something like ctid https://www.postgresql.org/docs/current/ddl-system-columns.h...

cryptonector · on March 11, 2025

If you declare a table that has no `PRIMARY KEY` (and no `UNIQUE` constraints on `NOT NULL` columns) you're essentially allowing duplicate rows. Either the RDBMS treats such a table as having something of a primary key on all columns (but NULLs sneak duplicate rows back in!), or the RDBMS gives you a row ID like concept (ctid counts) to truly disambiguate duplicate rows.

avinassh · on March 8, 2025

> But tables in PG are heaps, so appending to them is even faster than in b-trees.

elaborate? how it is different, than say, MySQL?

cryptonector · on March 10, 2025

In SQLite3 tables are b-trees.