Log storage and search system for structured logging data in Rust. i.e a databas...

notduncansmith · on March 21, 2020

I have a similar goal with this project, a multi-tenant log storage and retrieval system: https://github.com/notduncansmith/loghive

The idea is to shard the data by logical domain and by time segment, so that queries only apply to relatively small and efficiently-read data, and to exploit the embarrassingly-parallel nature of the problem.

jpgvm · on March 21, 2020

Yeah splitting by time is incredibly important.

I'm implementing the segmenting based on time + a sharding key to group together records of the same domain on the same shard. By sharing the shard with other domains it prevents having an overly large number of segments per time interval allowing them to be bigger and get better index density. Which is in important factor in my indexer design which amortizes the cost of the indices over large number of rows.

Storing logs in sqlite is definitely a neat way to go for smaller scale stuff, hope you make something cool out of it.

For me though I have faced this logs problem at very large scale numerous times in my career and have tried all manner of commercial and OSS solutions and have yet to be satisfied so my project is definitely geared to solving the sort of problems you have when everything else just either stops working or costs more to run than your actual app.

Not saying it won't scale down very well. I think my software on a single machine should easily handle atleast 20k+ logs/second (prototype is much faster atm but lots of features need to be added) and be able to serve queries concurrently with that on just a few cores and ~8-16gb/ram for say at 200-400GB dataset.

I think the 3 deployment sizes I will optimize for are single node for demo and benchmark purposes, 3 node for realistic small deployment and 6-10 nodes for high volume logging environments like my $DAY_JOB.

jotto · on March 21, 2020

I'm definitely interested if you OSS, I wrote a little about it here: https://jonathanotto.com/linear-search-benchmark

jpgvm · on March 21, 2020

Linear search approaches fall down when you have a lot of data and you only want to select a very small portion of it.

A linear based approach can get you to about 1GB/s or so per core with Rust.

A medium-ish size startup probably logs around 200GB/day of logs if they aren't very tight on their log volume. If you only want to search the last 24 hours that is maybe ok, you can search that in ~10-20 seconds on a single machine.

However this quickly breaks down when a) your log volume is a multiple of this and/or you want to search more than just a few hours.

In which case you need some sort of index.

There are different approaching to indexing logs. The most common is full text search indexing using an engine like Lucene. Elasticsearch (from the ELK stack) and Solr explicitly use Lucene. Splunk uses their own indexing format but I'm pretty sure it's in a similar vein. Papertrail uses Clickhouse which probably means they are using some sort of data skipping indices and lots of linear searching.

Of these approaches Clickhouse is probably the best way to go. It combines fast linear search with distributed storage and data skipping indices that reduce the amount of data you need to scan. (especially if you filter by PRE WHERE clauses).

So why not go with Clickhouse? Clickhouse requires a schema. You can do various things like flatten your nested structured data into KV (not a problem if you are already using a flat system) and have a single column for all keys and the other column for values. This works but doesn't get great compression, makes filtering ineffective for the most part and you now have to operate a distributed database that requires Zookeeper for coordination.

The reason I am choosing to build my own is that logs require unique indexing characteristics. First and foremost the storage system needs to be fully schemaless. Secondly you need to retain none word characters. The standard Lucene tokenizers in Elastic strip important punctuation that you might want to match on when searching log data. Field dimensionality can be very high so you need a system that won't buckle with metadata overhead when there are crazy numbers of unique fields, same goes for cardinality.

TLDR: For big users you must have indices in order not to search 20TB of logs for a month. Current indices suck for logs. I write custom index that is hella fast for regex.

manish_gill · on March 22, 2020

My first thought was also Clickhouse and the problems with schema for dynamic log content.

Also, if this is open source, I will definitely be interested in checking it out/contributing etc.

hodgesrm · on March 23, 2020

Here's another thought--why not try to fix this in ClickHouse? It sounds like you are rebuilding Elastic.

jpgvm · on March 23, 2020

I considered that but it's harder than it sounds. Clickhouse s very strongly coupled to the idea of a schema and it also is very coupled to only using indices for data skipping.

If I was to make the changes I want to Clickhouse, i.e schemaless and full indexes per/segment then it wouldn't be Clickhouse anymore.

wang502 · on March 21, 2020

How is it different from https://www.honeycomb.io/?

jpgvm · on March 21, 2020

Honeycomb seems to be more of a general events database, ala Druid.

This is a more specialised system that makes stronger tradeoffs to achieve really high efficiency for logs data. Stuff like indexing a reduced alphabet and not optimising for pivots and other views that are important for generic event databases.

Additionally Honeycomb is a hosted service.

I definitely intend for this to run in your own environment, on your k8s cluster, VMs, bare metal - whatever makes sense for you. If I do run it as a hosted service it will come secondary to the primary on-premise distribution.

wilhow · on March 23, 2020

This is a real need right now. Humio has a fairly novel approach and it searches ~10TB data in subsecs.

jpgvm · on March 24, 2020

Humio looks interesting but it appears to be a linear search approach. This is fine as I commented elsewhere and their numbers match what I was able to achieve with my linear based prototypes.

The reason I rejected this approach is it gets very expensive for large data volumes if you want your queries to remain responsive.

Say you want to search a 100 TB dataset (not as large as it sounds when it comes to log data...). You can do about 1GB/s/core assuming you have the data on local disks that can scan fast enough and each of your machines have 16 cores that is 16GB/s/machine. Lets say your query target time is 60s (pretty slow tbh, incredibly generous I would say).

The math then plays out like this. You can scan 60*16GB for each node in your cluster, i.e 960GB. You need to have that ~TB of data on disks that can read at 16GB/s which means you need it evenly spread across 8-16 very high end SSDs.

On top of that you need 100 of these machines to complete this query in 60 seconds.

Now Humio goes on to say you can store all your persistent data in a cloud bucket. Which is a good strategy and something I am employing too but if you have no indices you actually need to scan it all which means you are limited by how fast you can reasonably fetch the objects across your cluster and hence the speed of your network interfaces.

Say if you are on GCP which has a relatively fast network that seems capable of around 25Gbit to GCS most of the time and you actually get peak performance all the time (pretty unrealistic but ok). Then to fully scan 100TB in 60s you would need over 500 machines. If you are able to use Humios tags to reduce this somewhat say by only searching for errors and that gets you down to 10% of your total logs that would represent a 90% speedup. Humio sort of has quasi indexing in this way similar to Prometheus however they don't help you when what you are looking for isn't tagged to stand out.

This is why indices. Yes - indices are hard, yes they can have bad worst cases if you aren't super careful. However lets consider indices for this query.

Say your logs are syslog + some extra structured data. You have things like application_name, facility, level, etc. You have 100TB of logs to search, you are looking for logs by your main application which is say 60% of your total log volume, you are looking for DNS resolutions errors that contain a specific string "error resolving".

With my prototype my indices are approximately 5% the size of the ingested data. The raw ingested data is also compressed really highly, lets assume equal to Humio (it's probably higher due to file format but not important).

So the thing that jumps out here is we now only need 5TB of indices on our machines to reasonably find needles in our 100TB of data. Additionally our indices are split by field so if we know we are searching the message field we only need to load those. Indices for columns with low cardinality are extremely small, those for high cardinality much larger but capped due to various log specific optimisations I am able to use. So lets assume the message field makes up the majority of our index, say 80%. That brings us to 4TB of data we need to scan.

Now usually 60s would be super slow for a query for a system with indices like this, usually you would try make sure that 4TB is in RAM and simply blow through it in <1s across a few machines. However for comparisons sake lets say 60s is still our query budget and we don't have completely stupid amounts of RAM.

So we need to be able to scan 4TB/60s ~= 66GB/s which given our previous machine calcs with local storage puts us at ~5 machines.

However we could likely do this with even less CPU assuming our storage is fast enough as unlike an unindexed system like Humio we aren't applying expensive search algorithms to every row, we are simply crunching a ton of bitset operations in this case a metric ton of bitwise AND.

Anyway this is a long rant. The reason why is that many people always say "Can't you just do this fast enough with linear search" and I always have to reply "it depends on how big the haystack is". This quantifies what is too big of a haystack in a reasonable way.

akavel · on March 21, 2020

Definitely interested if FOSS. Also ideally if it's possible to dynamically create & modify filter pipelines/complex searches in it.

jpgvm · on March 21, 2020

What do you envision by filter pipelines and searches?

The query syntax I was thinking of focuses mostly filtering on and selecting data with regular expressions or equality matching on text and equality and range queries on numbers.

i.e something like this:

index_name.filter(message ~= "(hot|cold) storage" && drives.status > 100).select(drives.name, message, drives.status)

Or something more SQL like, haven't really settled on a query language yet mostly just been working on the indexing and the lower level parts of the query machinery that work with the index.

akavel · on March 22, 2020

I will try to answer you on keybase, hopefully it will be easier to discuss details.

tixocloud · on March 22, 2020

Would really be interested in the project. We’re building a tool to monitor model performance and data patterns and could see use of this.

lukevp · on March 21, 2020

How does this differ from Grafana Loki? This is a tool designed specifically for log indexing and searching as well.

jpgvm · on March 21, 2020

It's similar in some ways except Loki went all out in abandoning indexing of the log lines themselves, a quote from the homepage:

> It does not index the contents of the logs, but rather a set of labels for each log stream.

This system is different because not only does it index the contents of your log messages it accelerates regex queries which are great for structured messages produced by machines.

Similar to Loki though it does index data by "labels", though not quite in the same way. Instead every field of a log is treated the same (except for time, it's special). Nested structures are flattened by path, i.e '{"message: "a", "structure": {"nest1": "b"}' is flattened to "message" and "structure.nest1". Under the hood it also stores column per type so if you later send a number at the same field you previously sent text it doesn't coerce like Elasticsearch. Instead depending on your query it will either search "structure.nest1"(text) or "structure.nest1"(number), i.e if you use a regex it will infer you are searching the textual version, if you attempt a range query it will infer number as the type.

Loki is definitely an interesting system and looks like it could handle great ingest volume but I don't think it could match my targets for sub-second queries on ~10TB of logs.

Thaxll · on March 22, 2020

[flagged]

dang · on March 22, 2020

Please don't be a jerk in response to someone sharing what they're working on. If people can't share that without being jumped on, this quickly becomes a shitty place for conversation. Probably you didn't mean it that way, but it's difficult to gauge intent on the internet, so the burden is on the commenter to dismabiguate it.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

https://news.ycombinator.com/newsguidelines.html

jpgvm · on March 22, 2020

Did you read anything I wrote?

I have used all of the existing solutions that can scale to the workloads I require. None of them are adequate, all have deficiencies either in cost, scalability, ergonomics or operability.

If you read the rest of my comments here it should be clear this is something I have thought about for some time and have a well reasoned architecture and completely different take on how this problem should be approached.

This is not about features, this is about fundamentally changing the storage architecture from layout of the log data itself, to the indices to the query engine, even the distributed system model.

Most of the ideas are stolen from battle tested systems like Druid which I have worked with extensively and other systems I respect like Pilosa, Ok Log (which was a project in a similar vein).

Have a little optimism, things can be better if we just sit down and make them so.