Rendered at 11:02:33 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
russellthehippo 2 days ago [-]
Hey HN, I built this. Honker adds cross-process NOTIFY/LISTEN to SQLite. You get push-style event delivery with single-digit millisecond latency without a damon/broker, using your existing SQLite file. A lot of pretty high-traffic applications are just Framework+SQLite+Litestream on a VPS now, so I wanted to bring a sixer to the "just use SQLite" party.
SQLite doesn't run a server like Postgres, so the trick is moving the polling source from interval queries on a SQLite connection to a lightweight stat(2) on the WAL file. Many small queries are efficient in SQLite (https://www.sqlite.org/np1queryprob.html) so this isn't really a huge upgrade, but the cross-language result is pretty interesting to me - this is language agnostic as all you do is listen to the WAL file and call SQLite functions.
On top of the store/notify primitives, honker ships ephemeral pub/sub (like pg_notify), durable work queues with retries and dead-letter (like pg-boss/Oban), and event streams with per-consumer offsets. All three are rows in your app's existing .db file and can commit atomically with your business write. This is cool because a rollback drops both.
This used to be called litenotify/joblite but I bought honker.dev as a joke for my gf and I realized that every mq/task/worker have silly names: Oban, pg-boss, Huey, RabbitMQ, Celery, Sidekiq, etc. Thus a silly goose got its name.
Honker waddles the same path as these giants and honks into the same void.
Hopefully it's either useful to you or is amusing. Standard alpha software warnings apply.
andersmurphy 2 days ago [-]
Is the main use case for this for languages that only have access to process based concurrency?
Struggling to see why you would otherwise need this in java/go/clojure/C# your sqlite has a single writer, so you can notify all threads that care about inserts/updates/changes as your application manages the single writer (with a language level concurrent queue) so you know when it's writing and what it has just written. So it always felt simpler/cleaner to get notification semantics that way.
Still fun to see people abuse WAL in creative ways. Cool to see a notify mechanism that works for languages that only have process based concurrency python/JS/TS/ruby. Nice work!
zbentley 1 days ago [-]
There's more process-based concurrency than you'd expect in shops that use those languages.
Cron jobs might need to coordinate with webservers. Even heavily threaded webservers might have some subprocesses/forking to manage connection pools and hot reloads and whatnot. Suid programs are process-separated from non-suid programs. Plenty of places are in the "permanent middle" of a migration from e.g. Java 7 to Java 11 and migrate by splitting traffic to multiple copies of the same app running on different versions of the runtime.
If you're heavily using SQLite for your DB already, you probably are reluctant to replace those situations with multiple servers coordinating around a central DB.
Nit:
> languages that only have process based concurrency python/JS/TS/ruby
Not true. There are tons and tons of threaded Python web frameworks/server harnesses, and there were even before GIL-removal efforts started. Just because gunicorn/multiprocessing are popular doesn't mean there aren't loads of huge deployments running threads (and not suffering for it much, because most web stacks are IO bound). Ruby's similar, though threads are less heavily-used than in Python. JS/TS as well: https://nodejs.org/api/worker_threads.html
russellthehippo 2 days ago [-]
I actually hadn’t thought about it this way. The killer app I was imagining was 1ms reactivity without SQL polling and messaging atomic with business commits, plus “one db” and no daemon.
But this is actually a great main benefit as well.
infogulch 2 days ago [-]
He mentions Litestream, maybe this also works for litestream read-only replicas which may be in completely different locations?
russellthehippo 2 days ago [-]
Whoa I really hadn’t considered this. Do a litestream read replica, trigger across machines with S3 as the broker essentially. But you’re still stuck with the litestream sync interval. Maybe interesting for cross server notify?
infogulch 2 days ago [-]
I guess the idea is to have all writes go through a central server with local read replicas for improved read perf. The default litestream sync interval is 1s. I bet many use-cases would be satisfied with a few seconds delay for cross-region notifications.
russellthehippo 2 days ago [-]
It's good for pubsub but not for claim/ack workflow unless you do If-None-Match CAS semantics on a separate filesystem which, actually, yeah that's probably fine. Feels heavy on S3 ops. But! you do save on inter-AZ networking, the Warpstream hypothesis.
ncruces 2 days ago [-]
Claims kill this, IMO.
Unless you have a single "reader", you don't mind the delay, and don't worry about redoing a bunch of notifications after a crash (and so, can delay claims significantly), concurrency will kill this.
vrajat 1 days ago [-]
I wrote a simple queue implementation after reading the Turbopuffer blog on queues on S3. In my implementation, I wrote complete sqlite files to S3 on every enqueue/dequeue/act. it used the previous E-Tag for Compare-And-Set.
The experiment and back-of-the-envelope calculations show that it can only support ~ 5 jobs/sec. The only major factor to increase throughput is to increase the size of group commits.
I dont think shipping CDC instead of whole sqlite files will change the calculations as the number of writes mattered in this experiment.
So yes, the number of writes (min. of 3) can support very low throughputs.
russellthehippo 2 days ago [-]
exactly. then you're building distributed locking and it's probably time for a different tool
arowthway 2 days ago [-]
Nice, I had no idea that stat() every 1 ms is so affordable. Aparently it takes less than 1 μs per call on my hardware, so that's less than 0.1% cpu time for polling.
WJW 2 days ago [-]
"Syscalls are slow" is only mostly true. They are slower than not having to cross the userspace <-> OS barrier at all, but they're not "slow" like cross-ocean network calls can be. For example, non-VDSO syscalls in linux are about 250 nanoseconds (see for example https://arkanis.de/weblog/2017-01-05-measurements-of-system-...), VDSO syscalls are roughly 10x faster. Slower than userspace function calls for sure, but more than affordable outside the hottest of loops.
vlovich123 2 days ago [-]
Filesystem stuff tends to be slower than average syscalls because of all the locks and complicated traversals needed. If this is using stat instead of fstat then it’s also going through the VFS layer - repeated calls likely go through the cache fast path for path resolution but accessing the stat structure. There’s also hidden costs in that number like atomic accesses that need to acquire cache line locks that are going to cause hidden contention for other processes on the CPU + the cache dirtying from running kernel code and then subsequently having to repopulate it when leaving all of which adds contended L3/RAM pressure.
In other words, there’s a lot of unmeasured performance degradation that’s a side effect of doing many syscalls above and beyond the CPU time to enter/leave the kernel which itself has shrunk to be negligible. But there’s a reason high performance code is switching to io_uring to avoid that.
russellthehippo 2 days ago [-]
Oh cool, so using io uring plus pragma data version would actually beat stat on Linux holistically speaking? The stat choice was all about cross platform consistency over inotify speed. But syscalls overwhelm can be real.
vlovich123 1 days ago [-]
“Beat” is all relative. It depends on load and how frequently you’re doing it, but generally yes. But if you’re doing io_uring, you may as well use inotify because you’re in the platform specific API anyway as that’s the biggest win because you’re moving from polling to change detection which is less overhead and lower latency. Inotify can be accessed by io_uring and there may even be cross-platform libraries for your language that give you a consistent file watcher interface (although probably not optimally over io_uring). Whether it’s actually worth it is hard as I don’t know what problem you’re trying to solve, but the super lowest overhead looks like inotify+iouring (it also has the lowest latency)
xenadu02 1 days ago [-]
If you're interested you can use kqueue on FreeBSD and Darwin to watch the inode for changes. Faster than a syscall, especially if all you need is a wakeup when it changes.
slashdev 2 days ago [-]
That’s ignoring the other costs of syscalls like evicting your stuff from the CPU caches.
But I agree with the conclusion, system calls are still pretty fast compared to a lot of other things.
vlovich123 2 days ago [-]
Small correction on ambiguous wording - syscalls do not evict all your stuff from CPU caches. It just has to page in whatever is needed for kernel code/data accessed by the call, but that’s no different from if it was done in process as a normal function call.
Polizeiposaune 2 days ago [-]
Depending on implementation details of your CPU and OS, the syscall path may need to flush various auxillary caches (like one or more TLBs) to prevent speculation attacks, which may put additional "drag" on your program after syscall return.
vlovich123 1 days ago [-]
Correct but you’d also still have that drag just from the kernel dirtying those caches in the first place.
But I was clarifying because the wording could be taken as data/instruction cache and there generally isn’t a full flush of that just to enter/leave kernel.
ncruces 2 days ago [-]
Probably missing something, why is `stat(2)` better than: `PRAGMA data_version`?
Yeah the C API seems like a perfect fit for this use-case:
> [SQLITE_FCNTL_DATA_VERSION] is the only mechanism to detect changes that happen either internally or externally and that are associated with a particular attached database.
Another user itt says the stat(2) approach takes less than 1 μs per call on their hardware.
I wonder how these approaches compare across compatibility & performance metrics.
russellthehippo 1 days ago [-]
I just tested this out. PRAGMA data_version uses a shared counter that any connection can use while the C API appears to use a per-connection counter that does not see other connections' commits.
infogulch 16 hours ago [-]
Really? That's the opposite of what I understand the docs say.
Care to share your code? This may become a bug report.
russellthehippo 4 minutes ago [-]
Reporting back. This appears to be a bug in my original test the code of which sadly I did not commit anywhere. I went back to regenerate these tests and proved the opposite - the C API is better than PRAGMA and works across connections. I am going to make that update as I've proved across dozens of versions of SQLite that this is not in fact the case.
russellthehippo 15 hours ago [-]
Will do
psadri 2 days ago [-]
For one it seems to be deprecated.
ncruces 2 days ago [-]
It's not.
psadri 1 days ago [-]
You are correct. I apologize. I seemed to have read the next pragma’s depreciation notice!
Aside from this - SQLite has tons of cool features, like the session extension.
russellthehippo 2 days ago [-]
Yep, definitely still in use. Do yall above have an opinion if the pragma is better than the syscall? What are the trade offs there? Another comment thread mentioned this as well and pointed to io uring. I was thinking that dism spam is worse than syscall spam.
ncruces 2 days ago [-]
Depends on what to mean by better.
I may be wrong, but I think you wrote somewhere that you're looking at the WAL size increasing to know if something was committed. Well, the WAL can be truncated, what then? Or even, however unlikely, it could be truncated, then a transaction comes and appends just enough to it to make it the same size.
If SQLite has an API it guarantees can notify you of changes, that seems better, in the sense that you're passing responsibility along to the experts. It should also work with rollback mode, another advantage. And I don't think wakes you up if a large transaction rolls back (a transaction can hit the WAL and never commit).
That said, I'm not sure what's lighter on average. For a WAL mode database, I will say that something that has knowledge of the WAL index could potentially be cheaper? That file is mmapped. The syscalls involved are file locks, if any.
russellthehippo 2 days ago [-]
Interesting, thank you for the response and explanation. Honker workers/listerners are holding an open connection anyway. I do trust SQLite guarantees more than cross-platform sys behavior. I will explore the C API angle.
rich_sasha 2 days ago [-]
Pretty cool! I have a half baked version of something similar :)
Can you use it also as a lightweight Kafka - persistent message stream? With semantics like, replay all messages (historical+real time) from some timestamp for some topics?
As with pub/sub, you can reproduce this with some polling etc but as you say, that's not optimal.
russellthehippo 2 days ago [-]
Absolutely! That’s the durable pubsub angle for sure.
infogulch 2 days ago [-]
Neat idea!
Would it help if subscriber states were also stored? (read position, queue name, filters, etc) Then instead of waking all subscription threads to do their own N=1 SELECT when stat(2) changes, the polling thread could do Events INNER JOIN Subscribers and only wake the subscribers that match.
noveltyaccount 2 days ago [-]
This is really interesting. I'm building something on Postgresql with LISTEN/NOTIFY and Postgraphile. I'd love to (in theory) be able to have a swappable backend and not be so tightly coupled to the database server.
grumbelbart2 23 hours ago [-]
Very cool!
Another maybe stupid question, would something like inotify(7) help to get rid of any active polling?
hk1337 2 days ago [-]
I love the name!
russellthehippo 2 days ago [-]
honk
russellthehippo 1 days ago [-]
[Response to feedback]
Thanks all for your feedback, responses, and discussion. I've done a PR here taking your suggestions into account:
The PR implements a three-layer polling architecture:
- PRAGMA data_version every 1ms
- stat every 100ms
- retry connection to handle blips
1. PRAGMA data_version every 1ms replaces stat-based (size, mtime) change detection. This is SQLite's own commit counter: monotonic, immune to clock skew, correctly handles WAL truncation and rolled-back transactions. ~3µs nonblocking query. Credit to ncruces for pointing to this. This is not done for performance but for correctness as it is slightly slower. tuo-lei also pointed out truncation risk, which turned out to be more real than i thought.
Interesting note: I found in testing that the C API's SQLITE_FCNTL_DATA_VERSION does not work cross-connection. So for now honker continues paying the cost of going through the VFS layer which vlovich123 pointed out and now we tradeoff explicitly.
2. Reconnect-on-error: if the data_version query fails (disk blip, NFS hiccup, corrupted connection), honker tries to reconnect and wakes subscribers as a precaution. zbentley pointed me in this direction.
3. stat identity check every 100ms: compares (dev, ino) against startup values to detect file replacement (atomic rename, litestream restore, volume remount). data_version can't catch this because it polls through the open fd, which follows the original inode even after replacement. Credit to zbentley for the file-replacement scenarios.
Again - thanks for the discussion, honker got better because of it and I learned some stuff. See you round
JoelJacobson 2 days ago [-]
Shameless plug: In the upcoming release of PostgreSQL 19, LISTEN/NOTIFY has been optimized to scale much better with selective signaling, i.e. when lots of backends are listening on different channels, patch: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
Couldn't you use inotify (and/or some cross-platform wrapper) to watch for WAL changes without polling?
russellthehippo 2 days ago [-]
Breaks cross-platform, specifically Macs swallow silently. stat just works
Retr0id 2 days ago [-]
I don't believe this to be true.
russellthehippo 2 days ago [-]
See comment below - Darwin silently drops same-process notifs. I could change the behavior depending on same vs cross process and platform but I wanted to”just one thing to worry about”. Potentially a good optimization later. Would help reduce syscalls.
Retr0id 2 days ago [-]
I believe you are mistaken. If you are referring to the comment from ArielTM, that's an LLM bot regurgitating your readme.
russellthehippo 2 days ago [-]
Apologies for not being specific.
The specific thing I'm talking about is this: write events don't fire until the file handle is closed. [1] I didn't validate this myself btw, but my original design was certainly trying to use notify events rather than stat polling. My research (heavily AI assisted of course) led me away from that path as platforms differ in behavior and I wanted to avoid that.
If this has been fixed somewhere or there is a better alternative I'd love to use that over polling. Current plan is to move to polling data version for speed + occasional stat for safety. Getting rid of polling was my original goal but i compromised with syscalls.
xenadu02 1 days ago [-]
I have no idea why they aren't using kqueue but that works on macOS and FreeBSD. It has for years.
You want EVFILT_VNODE with NOTE_WRITE. That's hooked up to VNOP_WRITE in the kernel, the call made to the relevant filesystem to actually perform the write.
tuo-lei 2 days ago [-]
atomic commit with the business data is the selling point over separate IPC. external message passing always has the 'notification sent but transaction rolled back' problem and that gets messy.
one thing i'm curious about: WAL checkpoint. when SQLite truncates WAL back to zero, does the stat() polling handle that correctly? feels like there's a window where events could get lost.
kippinsula 21 hours ago [-]
the atomicity is the whole game. we burned time on a Postgres+SQS setup where the enqueue happened in a trigger that fired before the commit was visible to other connections. added retry logic, then polling on the worker side, then eventually moved the enqueue inside the transaction. at that point you're basically reinventing what Honker does, just with more moving parts. the 'notification sent, row not committed' class of bug is usually silent and timing-dependent, which makes it brutal to track down.
russellthehippo 2 days ago [-]
The WAL file sticks around but gets truncated so that counts as an update. Though I don’t have tests for this. Good input, thanks I’ll make sure
nzoschke 2 days ago [-]
Thanks for this!
I have a proliferation of small apps backed by SQLite. And most of these need a queue and scheduler.
I home rolled some stuff for it but was always pining for the elegance of the Postgres solutions.
Will give this a spin very soon
russellthehippo 1 days ago [-]
"a small proliferation" is a nice way to describe the cluster that is my side project habit. if you bump into any issues pls pull a PR or drop an issue on the repo!
davisphan04 4 hours ago [-]
Interesting approach—using WAL file stat instead of polling queries is clever. Curious how it behaves under high write contention?
ArielTM 2 days ago [-]
kqueue/FSEvents is tempting here, but Darwin drops same-process notifications. If you've got a publisher and listener in the same process the listener just never fires. Nasty thing to chase. stat polling looks gross but it's the only thing that actually works everywhere.
What happens on WAL checkpoint? When the file shrinks back, does that trigger a wakeup, or does the poller filter size drops?
xenadu02 16 hours ago [-]
This comment is completely incorrect.
kqueue VNODE events are delivered so long as your process has access to the file. There is no "same-process" notification filter.
russellthehippo 2 days ago [-]
Actually need to test this. Will report back
agentbonnybb 2 days ago [-]
This is the kind of skill-level tool I wish existed earlier — I hit
the exact pain point running a daily-chronicle site off SQLite + a
static deploy a week ago. Ended up with a crude polling loop because
the alternatives all wanted me to install Postgres for a single
notification semantic.
Question: any thoughts on what breaks first when a single process
has 10k+ concurrent listeners? I'm curious whether the SQLite side
can sustain what Postgres does cheaply.
russellthehippo 2 days ago [-]
10k listeners is a lot. Thundering herd issue at stat(). SQLite may not be your best choice at this scale.
russellthehippo 2 days ago [-]
also this is designed for a single machine. 10k listeners on one machine seems like a lot!
robertlagrant 2 days ago [-]
If I'm using SQLAlchemy, can this integrate? It seems to want to make the db connection itself.
russellthehippo 3 minutes ago [-]
I've added examples for many ORMs and web frameworks to the docs. See here:
Wouldn't processes on same machine be able to use different IPCs that don't even touch file ? It's neat but I have feeling in vast majority of cases just passing address to one of the IPC methods would be faster and then SQLite itself would only be needed for the durable parts.
blacklion 2 days ago [-]
This extension piggyback SQLite native transactions. For example, queueing data will be rolled back if transaction is rolled back due to some constrains violations.
It is possible to achieve with external IPC, but require a lot of very careful programming.
PunchyHamster 18 hours ago [-]
honestly I used those types of extensions so little I didn't even consider notifications being transactional and firing only when transaction completes
aldielshala 1 days ago [-]
Nice project. I'm also working on something that pushes SQLite well beyond its typical use case. It's encouraging to see more people exploring what SQLite can really do.
Oxlamarr 1 days ago [-]
Very cool. Is the bottleneck under load mostly SQLite write throughput, or the WAL notification layer?
russellthehippo 24 hours ago [-]
writes and claim/ack flow. really depends on your journal mode and synchrnous mode as well.
notifs are extremely cheap, either in the old stat(2) mode or the new PRAGMA page_version (see my update on feeback comment). Some other comments mentioned that stat(2) is about 1µs.
zbentley 1 days ago [-]
Very neat! I like this a lot, nice work.
After peeking the source, a few possible areas of improvement:
- You can use `fstat` and keep a file handle around, likely further improving performance (well, reducing the performance hit to other users of the filesystem by not resolving vfs nodes). If you do this, you'll have to check for file deletions.
- If you do stick with stat(2), it might be a good idea to track the inode number from the stat result in addition to the time,size tuple. That handles the "t,s = 1,2; honker gets SIGSTOPped/CRIU'd; database file replaced; honker started again", as well as renameat/symlink-swap fiddling. Changing inode probably should just trigger a crash.
- Also check the device number from the stat call. It sounds fringe, but the number of weird hellbugs I've dealt with in my career caused by code continually interacting with a file at the same time as something else mounted an equivalent path "over" the directory the file was originally in is nonzero.
- It's been a few years since I fought with this, but aren't there edge cases here if the system clock goes backwards? IIRC the inode timestamp isn't monotonic--right? There are various strategies for detecting clock adjustment, of various reliability, that you could use here, if so. Just checking if the mtime-vs-system-clock diff is negative is a start.
That covers the more common of the "vanishingly uncommon but I've still seen 'em" cases related to file modification detection. Whether you choose to cope with people messing with the file via utime(2) is up to you (past a point, it feels like coping with malicious misuse rather than edge cases). But since your code runs in a loop, you're well-positioned to do that (and detect drift/manipulations of the system clock): track a monotonic clock and use it to approximate the elapsed wall time between honker poller ticks (say it fast with an accent, and you get https://www.bbc.com/news/world-latin-america-11465127); if the timestamp reported by (f)stat(2) ever doesn't advance at the same rate, fall back to checksumming the file, or crashing or something. But this is well into the realm of abject paranoia by now.
It's been a decade or so since I worked in this area, so some of that knowledge is likely stale; you probably know a lot more than I do after developing this library even before considering how out-of-date my knowledge might be. When I worked on this stuff, I remember that statx(2) was going to solve all the problems any day now, and then didn't. More relevant, I also remember that the lsyncd (https://github.com/lsyncd/lsyncd) and watchman (https://github.com/facebook/watchman) codebases were really good sources of "what didn't I think of" information in this area.
But seriously, again, nice work! Those are nitpicks; this is awesome as-is!
russellthehippo 1 days ago [-]
Wow, thanks for the great feedback.
I actually looked at fstat, but the "check for deletions" piece, given I'm polling at 1kHZ, was the reason I decided not to use it. Older hardware actually made this a big issue but it's fast enough now I decided it wasn't a problem.
I'll ignore the malicious ones bc [out of scope declaration]. Object paranoia is an artifact of build trama and I respect that lmao.
I've just looked into the device number and system clock issues. I think what i'll end up doing is actually a combo of ncruces's above comment and your feedback: a 1kHZ data_version and a 10HZ stat() with version check. This gets around syscall load, avoid clock issues, avoids the WAL truncation issues that others have mentioned, and is both lighter weight and less bugabooable than my previous design.
Thanks again.
zbentley 15 hours ago [-]
Hope it helps!
One clarification: by "check for deletions" I didn't mean that you need to read back through the filesystem; you can check for deletions for free using fstat(2)'s result. The number of hard links to a file descriptor's underlying description returned by fstat includes the "existential" hard link of the file itself, and drops to zero when the file's deleted and the open handle is an orphan:
import os
import time
from threading import Thread, Event
f = '/tmp/foo.test'
ev = Event()
Thread(target=lambda: ev.wait() and os.unlink(f), daemon=True).start()
with open(f, 'w+') as fh:
print("before delete:", os.fstat(fh.fileno()).st_nlink)
ev.set()
time.sleep(1)
print("after delete:", os.fstat(fh.fileno()).st_nlink)
russellthehippo 15 hours ago [-]
Ha. Great callout. Will inspect further
nodesocket 2 days ago [-]
Awesome. I’m currently using AWS SQS which invokes lambda functions for asynchronous tasks like email sends, but Honker seems like a great local replacement.
Any conflicts or issues when running Litestream as well?
russellthehippo 2 days ago [-]
Nope! The extension just functions as a shortcut for raw SQL. Litestream edits the wal file but only like a normal checkpoint. So not too bad. Although I haven’t tested it directly. Probably need to
SQLite doesn't run a server like Postgres, so the trick is moving the polling source from interval queries on a SQLite connection to a lightweight stat(2) on the WAL file. Many small queries are efficient in SQLite (https://www.sqlite.org/np1queryprob.html) so this isn't really a huge upgrade, but the cross-language result is pretty interesting to me - this is language agnostic as all you do is listen to the WAL file and call SQLite functions.
On top of the store/notify primitives, honker ships ephemeral pub/sub (like pg_notify), durable work queues with retries and dead-letter (like pg-boss/Oban), and event streams with per-consumer offsets. All three are rows in your app's existing .db file and can commit atomically with your business write. This is cool because a rollback drops both.
This used to be called litenotify/joblite but I bought honker.dev as a joke for my gf and I realized that every mq/task/worker have silly names: Oban, pg-boss, Huey, RabbitMQ, Celery, Sidekiq, etc. Thus a silly goose got its name.
Honker waddles the same path as these giants and honks into the same void.
Hopefully it's either useful to you or is amusing. Standard alpha software warnings apply.
Struggling to see why you would otherwise need this in java/go/clojure/C# your sqlite has a single writer, so you can notify all threads that care about inserts/updates/changes as your application manages the single writer (with a language level concurrent queue) so you know when it's writing and what it has just written. So it always felt simpler/cleaner to get notification semantics that way.
Still fun to see people abuse WAL in creative ways. Cool to see a notify mechanism that works for languages that only have process based concurrency python/JS/TS/ruby. Nice work!
Cron jobs might need to coordinate with webservers. Even heavily threaded webservers might have some subprocesses/forking to manage connection pools and hot reloads and whatnot. Suid programs are process-separated from non-suid programs. Plenty of places are in the "permanent middle" of a migration from e.g. Java 7 to Java 11 and migrate by splitting traffic to multiple copies of the same app running on different versions of the runtime.
If you're heavily using SQLite for your DB already, you probably are reluctant to replace those situations with multiple servers coordinating around a central DB.
Nit:
> languages that only have process based concurrency python/JS/TS/ruby
Not true. There are tons and tons of threaded Python web frameworks/server harnesses, and there were even before GIL-removal efforts started. Just because gunicorn/multiprocessing are popular doesn't mean there aren't loads of huge deployments running threads (and not suffering for it much, because most web stacks are IO bound). Ruby's similar, though threads are less heavily-used than in Python. JS/TS as well: https://nodejs.org/api/worker_threads.html
But this is actually a great main benefit as well.
Unless you have a single "reader", you don't mind the delay, and don't worry about redoing a bunch of notifications after a crash (and so, can delay claims significantly), concurrency will kill this.
The experiment and back-of-the-envelope calculations show that it can only support ~ 5 jobs/sec. The only major factor to increase throughput is to increase the size of group commits.
I dont think shipping CDC instead of whole sqlite files will change the calculations as the number of writes mattered in this experiment.
So yes, the number of writes (min. of 3) can support very low throughputs.
In other words, there’s a lot of unmeasured performance degradation that’s a side effect of doing many syscalls above and beyond the CPU time to enter/leave the kernel which itself has shrunk to be negligible. But there’s a reason high performance code is switching to io_uring to avoid that.
But I agree with the conclusion, system calls are still pretty fast compared to a lot of other things.
But I was clarifying because the wording could be taken as data/instruction cache and there generally isn’t a full flush of that just to enter/leave kernel.
https://sqlite.org/pragma.html#pragma_data_version
Or for a C API that's even better, `SQLITE_FCNTL_DATA_VERSION`:
https://sqlite.org/c3ref/c_fcntl_begin_atomic_write.html#sql...
> [SQLITE_FCNTL_DATA_VERSION] is the only mechanism to detect changes that happen either internally or externally and that are associated with a particular attached database.
Another user itt says the stat(2) approach takes less than 1 μs per call on their hardware.
I wonder how these approaches compare across compatibility & performance metrics.
Care to share your code? This may become a bug report.
Aside from this - SQLite has tons of cool features, like the session extension.
I may be wrong, but I think you wrote somewhere that you're looking at the WAL size increasing to know if something was committed. Well, the WAL can be truncated, what then? Or even, however unlikely, it could be truncated, then a transaction comes and appends just enough to it to make it the same size.
If SQLite has an API it guarantees can notify you of changes, that seems better, in the sense that you're passing responsibility along to the experts. It should also work with rollback mode, another advantage. And I don't think wakes you up if a large transaction rolls back (a transaction can hit the WAL and never commit).
That said, I'm not sure what's lighter on average. For a WAL mode database, I will say that something that has knowledge of the WAL index could potentially be cheaper? That file is mmapped. The syscalls involved are file locks, if any.
Can you use it also as a lightweight Kafka - persistent message stream? With semantics like, replay all messages (historical+real time) from some timestamp for some topics?
As with pub/sub, you can reproduce this with some polling etc but as you say, that's not optimal.
Would it help if subscriber states were also stored? (read position, queue name, filters, etc) Then instead of waking all subscription threads to do their own N=1 SELECT when stat(2) changes, the polling thread could do Events INNER JOIN Subscribers and only wake the subscribers that match.
Another maybe stupid question, would something like inotify(7) help to get rid of any active polling?
Thanks all for your feedback, responses, and discussion. I've done a PR here taking your suggestions into account:
https://github.com/russellromney/honker/pulls/1
The PR implements a three-layer polling architecture: - PRAGMA data_version every 1ms - stat every 100ms - retry connection to handle blips
1. PRAGMA data_version every 1ms replaces stat-based (size, mtime) change detection. This is SQLite's own commit counter: monotonic, immune to clock skew, correctly handles WAL truncation and rolled-back transactions. ~3µs nonblocking query. Credit to ncruces for pointing to this. This is not done for performance but for correctness as it is slightly slower. tuo-lei also pointed out truncation risk, which turned out to be more real than i thought.
Interesting note: I found in testing that the C API's SQLITE_FCNTL_DATA_VERSION does not work cross-connection. So for now honker continues paying the cost of going through the VFS layer which vlovich123 pointed out and now we tradeoff explicitly.
2. Reconnect-on-error: if the data_version query fails (disk blip, NFS hiccup, corrupted connection), honker tries to reconnect and wakes subscribers as a precaution. zbentley pointed me in this direction.
3. stat identity check every 100ms: compares (dev, ino) against startup values to detect file replacement (atomic rename, litestream restore, volume remount). data_version can't catch this because it polls through the open fd, which follows the original inode even after replacement. Credit to zbentley for the file-replacement scenarios.
Again - thanks for the discussion, honker got better because of it and I learned some stuff. See you round
The specific thing I'm talking about is this: write events don't fire until the file handle is closed. [1] I didn't validate this myself btw, but my original design was certainly trying to use notify events rather than stat polling. My research (heavily AI assisted of course) led me away from that path as platforms differ in behavior and I wanted to avoid that.
[1] https://github.com/notify-rs/notify/issues/240
You want EVFILT_VNODE with NOTE_WRITE. That's hooked up to VNOP_WRITE in the kernel, the call made to the relevant filesystem to actually perform the write.
one thing i'm curious about: WAL checkpoint. when SQLite truncates WAL back to zero, does the stat() polling handle that correctly? feels like there's a window where events could get lost.
I have a proliferation of small apps backed by SQLite. And most of these need a queue and scheduler.
I home rolled some stuff for it but was always pining for the elegance of the Postgres solutions.
Will give this a spin very soon
What happens on WAL checkpoint? When the file shrinks back, does that trigger a wakeup, or does the poller filter size drops?
kqueue VNODE events are delivered so long as your process has access to the file. There is no "same-process" notification filter.
Question: any thoughts on what breaks first when a single process has 10k+ concurrent listeners? I'm curious whether the SQLite side can sustain what Postgres does cheaply.
https://honker.dev/guides/orm/#sqlalchemy--sqlmodel
It is possible to achieve with external IPC, but require a lot of very careful programming.
notifs are extremely cheap, either in the old stat(2) mode or the new PRAGMA page_version (see my update on feeback comment). Some other comments mentioned that stat(2) is about 1µs.
After peeking the source, a few possible areas of improvement:
- You can use `fstat` and keep a file handle around, likely further improving performance (well, reducing the performance hit to other users of the filesystem by not resolving vfs nodes). If you do this, you'll have to check for file deletions.
- If you do stick with stat(2), it might be a good idea to track the inode number from the stat result in addition to the time,size tuple. That handles the "t,s = 1,2; honker gets SIGSTOPped/CRIU'd; database file replaced; honker started again", as well as renameat/symlink-swap fiddling. Changing inode probably should just trigger a crash.
- Also check the device number from the stat call. It sounds fringe, but the number of weird hellbugs I've dealt with in my career caused by code continually interacting with a file at the same time as something else mounted an equivalent path "over" the directory the file was originally in is nonzero.
- It's been a few years since I fought with this, but aren't there edge cases here if the system clock goes backwards? IIRC the inode timestamp isn't monotonic--right? There are various strategies for detecting clock adjustment, of various reliability, that you could use here, if so. Just checking if the mtime-vs-system-clock diff is negative is a start.
That covers the more common of the "vanishingly uncommon but I've still seen 'em" cases related to file modification detection. Whether you choose to cope with people messing with the file via utime(2) is up to you (past a point, it feels like coping with malicious misuse rather than edge cases). But since your code runs in a loop, you're well-positioned to do that (and detect drift/manipulations of the system clock): track a monotonic clock and use it to approximate the elapsed wall time between honker poller ticks (say it fast with an accent, and you get https://www.bbc.com/news/world-latin-america-11465127); if the timestamp reported by (f)stat(2) ever doesn't advance at the same rate, fall back to checksumming the file, or crashing or something. But this is well into the realm of abject paranoia by now.
It's been a decade or so since I worked in this area, so some of that knowledge is likely stale; you probably know a lot more than I do after developing this library even before considering how out-of-date my knowledge might be. When I worked on this stuff, I remember that statx(2) was going to solve all the problems any day now, and then didn't. More relevant, I also remember that the lsyncd (https://github.com/lsyncd/lsyncd) and watchman (https://github.com/facebook/watchman) codebases were really good sources of "what didn't I think of" information in this area.
But seriously, again, nice work! Those are nitpicks; this is awesome as-is!
I actually looked at fstat, but the "check for deletions" piece, given I'm polling at 1kHZ, was the reason I decided not to use it. Older hardware actually made this a big issue but it's fast enough now I decided it wasn't a problem.
I'll ignore the malicious ones bc [out of scope declaration]. Object paranoia is an artifact of build trama and I respect that lmao.
I've just looked into the device number and system clock issues. I think what i'll end up doing is actually a combo of ncruces's above comment and your feedback: a 1kHZ data_version and a 10HZ stat() with version check. This gets around syscall load, avoid clock issues, avoids the WAL truncation issues that others have mentioned, and is both lighter weight and less bugabooable than my previous design.
Thanks again.
One clarification: by "check for deletions" I didn't mean that you need to read back through the filesystem; you can check for deletions for free using fstat(2)'s result. The number of hard links to a file descriptor's underlying description returned by fstat includes the "existential" hard link of the file itself, and drops to zero when the file's deleted and the open handle is an orphan:
Any conflicts or issues when running Litestream as well?