I build things:
https://beepb00p.xyz https://github.com/karlicoss https://twitter.com/karlicoss
- datetime handling can absolutely be a hot spot, especially if you're parsing or formatting them. Even for relatively simple things like "parse a huge csv file with dates into dataclasses".
In particular, default implementation of datetime in cpython is a C module (with a fallback to pure python one) https://github.com/python/cpython/blob/main/Modules/_datetim...
Not saying it's necessarily justified in case of this library, but if they want to compete with stdlib datetime in terms of performance, some parts will need to be compiled.
- 23 points
- I argued that point in my article some time ago https://beepb00p.xyz/configs-suck.html also HN discussion at the time news.ycombinator.com/item?id=22787332
- 207 points
- I was annoyed by cron/fcron limitations and figured systemd is the way go because of its flexibility and power, but also was annoyed about manually managing tons of unit files. So I wrote a tool with a config that looks kinda like a crontab, but uses systemd (or launchd on mac) behind the scenes: https://github.com/karlicoss/dron#what-does-it-do
E.g. a simplest job definition looks like this
But also it's possible to add more properties, e.g. arbirary systemd properties, or custom notification backends (e.g. telegram message or desktop notification)job(every(mins=10), 'ping https://beepb00p.xyz', unit_name='ping-beepb00p')Since it's python, I can reuse variables, use for loops, import jobs from other files (e.g. if there are shared jobs between machines), and check if it's valid with mypy.
Been using this for years now, and very happy, one of the most useful tools I wrote.
It's a bit undocumented and messy but if you are interested in trying it out, just let me know, I'm happy to help :)
- I sync about two hundred thousands of files without any problems, especially if your aren't changing them all the time. The only issue I can imagine is initial sync with so many files might take a while, even if total size isn't huge. As for several terabytes, for me it's a bit less than a terabyte in total for all synced folders, but also can't see why it wouldn't scale up. I guess similar, initial hashing might take some time, but otherwise it should handle it well.
- I'm rooting to get access to my own data (typically in sqlite databases in protected /data/data partition). Then I feed it into HPI (Human Programming Interface) [1], and from that it gets into my plaintext search system [2] or promnesia [3]
[1] https://github.com/karlicoss/HPI#readme
[2] https://beepb00p.xyz/pkm-search.html#personal_information
- First, big respect for working on software for so many years!
My question is what data format is it using? I found some examples here [1], but looks like it's a custom binary format?
Is there a functionality to auto-export (e.g. on save) to plaintext (xml/json/whatever), so I could hook TreeSheets files to other apps? I appreciate it would be lossy, but even a tree/graph structure with text nodes would be good.
E.g. I'm a big fan of using plaintext search over all of my personal data/information, even in siloed apps [2]
[1] https://github.com/aardappel/treesheets/tree/master/TS/examp...
[2] https://beepb00p.xyz/pkm-search.html#personal_information
- Hey, it's a bit dated, I've been meaning to update the post for a while, but haven't had time.
I actually bought a remarkable 2 since, but I didn't really end up using it much. IIRC main reason was that annotations are a custom format, and they are basically drawings with highlighter (as opposed to plaintext). I think there were some projects to match them against books and try to extract text, but it didn't work reliably for me. I may be wrong though, maybe things changed since.
That said if you install Koreader on it, then you get proper annotations, and I've been meaning to try to incorporate them in my flow.
- Some time ago I wanted the best bits from both worlds:
- from cron: specifying all jobs in one file instead of scattering it across dosens of unit files. In 90% of cases I just want a regular schedule and the command, that's it
- from systemd: mainly monitoring and logging. But also flexible timers, timeouts, resource management, dependencies -- for the remaining 10% of jobs which are a little more complicated
So I implemented a DSL which basically translates a python spec into systemd units -- that way I don't have to remember systemd syntax and manually manage the unit files. At the same time I benefit from the simplicity of having everything in one place.
An extra bonus is that the 'spec' is just normal python code
- you can define variables/functions/loops to avoid copy pasting
- you can use mypy to lint it before applying the changes
- I have multiple computers that share some jobs, so I simply have a 'common.py' file which I import from `computer1.py` and `computer2.py` -- the whole thing is very flexible.
You can read more about it here:
- https://beepb00p.xyz/scheduler.html
- https://github.com/karlicoss/dron#what-does-it-do
I've been using this tool for several year now, with hundreds of different jobs across 3 computers, and it's been working perfectly for me. One of the best quality of life improvements I've done for my personal infrastructure.
- 2 points
- Yes.
- you use an actual programming language (which you're likely to already know) instead of desperately figuring out how to simulate a for loop in yaml or to replace a part of a string, or whatnot
- you have all Python static analysis tools at your disposal. mypy/pylint etc can make your deploys less error prone
- you can easily implement a custom DSL in python, so deploys ends up more declarative and with less copy pasting
- it's easy to implement custom arbitrary operations (since you're working in python) -- and again you can reuse variables, etc instead of having to use some weird templating or pass them in command line
- 292 points
- I stopped worrying and in most cases just write configs for my tools in python. Then you can just import/exec it and you're done. I can use all the operations/primitives I already know in Python: string interpolation & operations, loops, Pathlib, imports etc. I can use mypy and all the other existing linting tools to make sure my configuration is correct without having to write a custom linter (and basically reimplement 10% of mypy).
Shameless plug if you wanna read a longer analysis: https://beepb00p.xyz/configs-suck.html
A great example is pyinfra https://github.com/Fizzadar/pyinfra#readme Think Ansible but instead of YAML you write Python. It provides a set of primitives/DSL and some rules you need to adhere to, but otherwise you just write regular python code.
- If you want a TLDR/teaser, watch for a minute from https://youtu.be/oI_X2cMHNe0?t=653 . It's fascinating :)
- Oh nice, I like it!
So it basically automates detecting useful bits for a particular URL, but it's kind of time consuming and flaky. It could be very helpful to populate the 'rules' database though, and then this database could be shared with other people so they don't have to scrape.
I guess when I said ML (or preferably some fuzzy algorithm/heuristic), I was referring to generifying rules so they also work on the sites not in the rules database. If humans can detect garbage in the URL looking at a few examples, the computer can too :)
- It's kind of tricky to do in general case, e.g. even hackernews is keeping meaningful semantic information in id= query parameter.
Because of that it ultimately needs to a site-specific database/algorithm, perhaps with a fallback to the default behaviour like simply cleaning up the most common garbage like (_encoding/usg/etc). I suspect it's possible to use some sort of machine learning to guess the meaningful parts of the URL path/query/fragments, but even for that we need some human curation for the training set. I wish we could collaborate on a shared database/library for that, have sketched some ideas/applications/prior art here: https://beepb00p.xyz/exobrain/projects/cannon.html
I started thinking about it since I have a similar problem in Promnesia (https://github.com/karlicoss/promnesia#readme), a knowledge management tool I'm working on. Ideally I want to normalise URLS, so they address the exact bit of information, and nothing more.
[0] https://getpocket.com/export
[1] https://github.com/karlicoss/pockexport?tab=readme-ov-file#s...
[2] https://github.com/karlicoss/pockexport/blob/master/example-...