Preferences

karlicoss
Joined 3,247 karma
Trapped in meatspace. Trying to break out.

I build things:

https://beepb00p.xyz https://github.com/karlicoss https://twitter.com/karlicoss


  1. Seems like official export [0] has tags and annotations along with timestamps. However in case you'd like more structured full data from the API (instead of a mess with csv + json?), you can use my tool [1] to export it. Here's example of its output [2]

    [0] https://getpocket.com/export

    [1] https://github.com/karlicoss/pockexport?tab=readme-ov-file#s...

    [2] https://github.com/karlicoss/pockexport/blob/master/example-...

  2. datetime handling can absolutely be a hot spot, especially if you're parsing or formatting them. Even for relatively simple things like "parse a huge csv file with dates into dataclasses".

    In particular, default implementation of datetime in cpython is a C module (with a fallback to pure python one) https://github.com/python/cpython/blob/main/Modules/_datetim...

    Not saying it's necessarily justified in case of this library, but if they want to compete with stdlib datetime in terms of performance, some parts will need to be compiled.

  3. I argued that point in my article some time ago https://beepb00p.xyz/configs-suck.html also HN discussion at the time news.ycombinator.com/item?id=22787332
  4. I was annoyed by cron/fcron limitations and figured systemd is the way go because of its flexibility and power, but also was annoyed about manually managing tons of unit files. So I wrote a tool with a config that looks kinda like a crontab, but uses systemd (or launchd on mac) behind the scenes: https://github.com/karlicoss/dron#what-does-it-do

    E.g. a simplest job definition looks like this

      job(every(mins=10), 'ping https://beepb00p.xyz', unit_name='ping-beepb00p')
    
    But also it's possible to add more properties, e.g. arbirary systemd properties, or custom notification backends (e.g. telegram message or desktop notification)

    Since it's python, I can reuse variables, use for loops, import jobs from other files (e.g. if there are shared jobs between machines), and check if it's valid with mypy.

    Been using this for years now, and very happy, one of the most useful tools I wrote.

    It's a bit undocumented and messy but if you are interested in trying it out, just let me know, I'm happy to help :)

  5. Another thing I noticed is that homebrew python was noticeably slower on M2 comparing to the pyenv one. I imagine homebrew compiles it with too generic flags to support wide range of macs.
  6. It sparks a discussion around knowledge management, that's kinda nice :)
  7. Or you can use pipx, it deals with all the virtualenv business behind the scenes
  8. FWIW, never happened to me with Synching, but either way it's best to separate sync software and backup software. Sync might work flawlessly, but the user might make a mistake which would quickly wipe the files across all devices. I use borg for backups, highly recommend
  9. I sync about two hundred thousands of files without any problems, especially if your aren't changing them all the time. The only issue I can imagine is initial sync with so many files might take a while, even if total size isn't huge. As for several terabytes, for me it's a bit less than a terabyte in total for all synced folders, but also can't see why it wouldn't scale up. I guess similar, initial hashing might take some time, but otherwise it should handle it well.
  10. not sure what happens if you commit on two devices before syncing, but the "worst" that happened to me is I get an index conflict, in which case it's easily fixed by 'git reset'
  11. I'm rooting to get access to my own data (typically in sqlite databases in protected /data/data partition). Then I feed it into HPI (Human Programming Interface) [1], and from that it gets into my plaintext search system [2] or promnesia [3]

    [1] https://github.com/karlicoss/HPI#readme

    [2] https://beepb00p.xyz/pkm-search.html#personal_information

    [3] https://beepb00p.xyz/promnesia.html

  12. First, big respect for working on software for so many years!

    My question is what data format is it using? I found some examples here [1], but looks like it's a custom binary format?

    Is there a functionality to auto-export (e.g. on save) to plaintext (xml/json/whatever), so I could hook TreeSheets files to other apps? I appreciate it would be lossy, but even a tree/graph structure with text nodes would be good.

    E.g. I'm a big fan of using plaintext search over all of my personal data/information, even in siloed apps [2]

    [1] https://github.com/aardappel/treesheets/tree/master/TS/examp...

    [2] https://beepb00p.xyz/pkm-search.html#personal_information

  13. Hey, it's a bit dated, I've been meaning to update the post for a while, but haven't had time.

    I actually bought a remarkable 2 since, but I didn't really end up using it much. IIRC main reason was that annotations are a custom format, and they are basically drawings with highlighter (as opposed to plaintext). I think there were some projects to match them against books and try to extract text, but it didn't work reliably for me. I may be wrong though, maybe things changed since.

    That said if you install Koreader on it, then you get proper annotations, and I've been meaning to try to incorporate them in my flow.

  14. Yep, thanks! I really need to update the post with zotero
  15. Some time ago I wanted the best bits from both worlds:

    - from cron: specifying all jobs in one file instead of scattering it across dosens of unit files. In 90% of cases I just want a regular schedule and the command, that's it

    - from systemd: mainly monitoring and logging. But also flexible timers, timeouts, resource management, dependencies -- for the remaining 10% of jobs which are a little more complicated

    So I implemented a DSL which basically translates a python spec into systemd units -- that way I don't have to remember systemd syntax and manually manage the unit files. At the same time I benefit from the simplicity of having everything in one place.

    An extra bonus is that the 'spec' is just normal python code

    - you can define variables/functions/loops to avoid copy pasting

    - you can use mypy to lint it before applying the changes

    - I have multiple computers that share some jobs, so I simply have a 'common.py' file which I import from `computer1.py` and `computer2.py` -- the whole thing is very flexible.

    You can read more about it here:

    - https://beepb00p.xyz/scheduler.html

    - https://github.com/karlicoss/dron#what-does-it-do

    I've been using this tool for several year now, with hundreds of different jobs across 3 computers, and it's been working perfectly for me. One of the best quality of life improvements I've done for my personal infrastructure.

  16. Yes.

    - you use an actual programming language (which you're likely to already know) instead of desperately figuring out how to simulate a for loop in yaml or to replace a part of a string, or whatnot

    - you have all Python static analysis tools at your disposal. mypy/pylint etc can make your deploys less error prone

    - you can easily implement a custom DSL in python, so deploys ends up more declarative and with less copy pasting

    - it's easy to implement custom arbitrary operations (since you're working in python) -- and again you can reuse variables, etc instead of having to use some weird templating or pass them in command line

  17. httpa://grep.app often helps for obscure code searches on GitHub
  18. as a workaround I'm using a keyword search bookmark in firefox mapped to 'g'

         https://www.google.com/search?q=%s&tbs=li:1
  19. yep, can confirm, it basically filters out any notebook output from version control (while keeping it intact in the notebook file itself). This works seamlessly with diffing, committing, staging, etc.
  20. I stopped worrying and in most cases just write configs for my tools in python. Then you can just import/exec it and you're done. I can use all the operations/primitives I already know in Python: string interpolation & operations, loops, Pathlib, imports etc. I can use mypy and all the other existing linting tools to make sure my configuration is correct without having to write a custom linter (and basically reimplement 10% of mypy).

    Shameless plug if you wanna read a longer analysis: https://beepb00p.xyz/configs-suck.html

    A great example is pyinfra https://github.com/Fizzadar/pyinfra#readme Think Ansible but instead of YAML you write Python. It provides a set of primitives/DSL and some rules you need to adhere to, but otherwise you just write regular python code.

  21. If you want a TLDR/teaser, watch for a minute from https://youtu.be/oI_X2cMHNe0?t=653 . It's fascinating :)
  22. Oh nice, I like it!

    So it basically automates detecting useful bits for a particular URL, but it's kind of time consuming and flaky. It could be very helpful to populate the 'rules' database though, and then this database could be shared with other people so they don't have to scrape.

    I guess when I said ML (or preferably some fuzzy algorithm/heuristic), I was referring to generifying rules so they also work on the sites not in the rules database. If humans can detect garbage in the URL looking at a few examples, the computer can too :)

  23. Yeah, sadly, to get the canonical attribute, you need to fetch the URL first (which is slow and wasteful). Also sometimes canonical would still be different on the desktop and mobile version of the site, so it still has to be normalised after that
  24. your comments are always such an inspiration to read :) thank you!
  25. It's kind of tricky to do in general case, e.g. even hackernews is keeping meaningful semantic information in id= query parameter.

    Because of that it ultimately needs to a site-specific database/algorithm, perhaps with a fallback to the default behaviour like simply cleaning up the most common garbage like (_encoding/usg/etc). I suspect it's possible to use some sort of machine learning to guess the meaningful parts of the URL path/query/fragments, but even for that we need some human curation for the training set. I wish we could collaborate on a shared database/library for that, have sketched some ideas/applications/prior art here: https://beepb00p.xyz/exobrain/projects/cannon.html

    I started thinking about it since I have a similar problem in Promnesia (https://github.com/karlicoss/promnesia#readme), a knowledge management tool I'm working on. Ideally I want to normalise URLS, so they address the exact bit of information, and nothing more.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal