If you disagree or otherwise think I'm wrong, please reply. I'm always willing to be educated, or to explain myself further if I was unclear before.
If I'm right and this annoys you, downvote without reply and I'll understand.
- dspillettThat wasn't the first time they had similar products out-speeding Intel. I have the CPU from the first PC I owned tacked to the front of my current main PC with a Ryzen. That was clocked at 20MHz IIRC (I'm at parental home ATM so can't confirm) where the Intel units topped out at 12MHz (unless overclocked, or course).
- > Per the spec [0], a URL can hold at least 8,000 characters.
> It is RECOMMENDED that all senders and recipients support, at a minimum, URIs with lengths of 8000 octets in protocol elements.
It is always worth remembering that, unless you have already ensured that the content has been rendered into a URI-safe subset of ASCII, a character and an octet are not the same thing.
- That is a very low bar. If methane's chemical representation was present in a DVD key, you could be done for DMCA violation every time you fart.
- > It sets a bad precedent to call things like this hacks.
That ship sailed a long time ago. The “phone hacking scandal” in ~2010¹ was mostly calling answering services that didn't have pins or other authorisation checks set.
These days any old trick gets called a hack, heck tying your shoelaces might get called a miraculous footwear securing hack.
--------
[1] https://en.wikipedia.org/wiki/News_International_phone_hacki...
- A fair few I expect, amongst actively developed apps/utils/libs. Away from sid (unstable) Debian packages are often a bit behind upstream but still supported, so security fixes are often back-ported if the upstream project isn't also maintaining older releases that happen to match the version(s) in testing/stable/oldstable.
- That would only affect those calling out directly. Many scrapers operate through a battery of proxies so will be hidden by such a simple test.
If your goal is to be blocked by China's great firewall, including mention of tank man and the Tiananmen Square massacre more generally, and certain pooh bear related imagery, might help.
- > Kagi
I've been toying with that for ages on and off. Finally now a paid up user due to the fact that their guesswork engine (or makey-upy machine, or your preferred name) can be easily turned off, and stays off until requested otherwise.
- I think it is him. Chrome making blocking harder is one of the issues that has been pushing some users away (and a good portion of those in the direction of FF). If FF is not better is that regard then those moving away for that reason will go elsewhere, and those who are there already at least in part for that reason will move away.
If this happened it would be the final straw for me, if I wasn't already looking to change because of them confirming the plan to further descend into the great “AI” cult.
- I wouldn't say it is incredibly database specific, it is more database type specific. For most general, non-sharded, databases, random key values can be a problem as they lead to excess fragmentation in b-trees and similar structures.
- > Do not assume that UUIDs are hard to guess; they should not be used as security capabilities
It is not just about being hard to guess a valid individual identifier in vacuum. Random (or at least random-ish) values, be they UUIDs or undecorated integers, in this context are also about it being hard to guess one from another, or a selection of others.
Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!
- It is, but partly because it is a common form in the training data. LLM output seems to use the form more than people, presumably either due to some bias in the training data (or the way it is tokenised) or due to other common token sequences leading into it (remember: it isn't an official acronym but Glorified Predictive Text is an accurate description). While it is a smell, it certainly isn't a reliable marker, there needs to be more evidence than that.
- I'm obviously in the wrong groups on facebook.
Oh, there is some passion the other way.
I'm happy that the down-vote-y anger here is on the correct side! (unless you are the only one who agrees and the other downs are from the “how dare you suggest I might do something wrong” mob)
- Interesting, I'll have to give that a detailed read later. It might be applicable to 3D prints.
To head off the people who will jump up-and-down calling me paranoid for not considering untreated printed works food safe, and accusing me of accusing them of poisoning family & friends (in some circles the discussion can get more cantankerous than the vi/emacs thing!): you keep using printed things for food without treatment if you like, and I won't judge, but I prefer to remain paranoid because if printed items were food safe it would be a selling point and I don't see any manufacturers using food based examples in their advertising.
- > your HN data […] is shared and licensed with all
TBH, if a service doesn't explicitly say what data I expose to it _won't_ be shared, I assume it will be immediately and repeatedly.
Though also if a service does explicitly say the data won't be shared, I still assume that it will eventually be given to the highest bidder, then the next highest, and the next, and so on. If not deliberately, it will at some point be hacked from without or unofficially exfiltrated from within.
And on a public site like HN all bets are off as the information is probably being scraped by everyone, their dogs, and their dogs' fleas, even more so now LLMs are such a big thing.
- I think mine is stated as 6.7" and pretty thin. I wouldn't say it fits comfortably in all conditions. In some trousers/shorts it either sticks out a bit or digs in my side when I'm sat and bend (to tie a show, etc).
- > UNIX was UNICS which was a pun on MULTICS.
I doubt it is official, but I was told the name Unix was picked as it was "Multics with bits taken off".
> I couldn't for the life of me tell you what dd stands for.
I always assumed “data dump” or something like.
- > VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
- > I do not understand why the scrappers do not do it in a smarter way
If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.
If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.
If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.
> why the scrappers do not do it in a smarter way
A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.
----
[0] the fact this load might be inconvenient to you is immaterial to the scraper
[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.
- There won't be one single reason. For some it is a dark sense of humour perhaps twisted a little too far off track, that perhaps they should keep in their own head or at least just between very close friends. For some it is simply money without caring that it might upset people: get enough engagement and ad impressions and it is worthwhile if you can ignore the moral aspect. Money might not be the objective at all, there are people who just want the attention, or the appearance of attention, and fake internet points (youtube views and such) sate their need at least temporarily. For some it is simply deliberate griefing, for all the reasons that is a thing generally. Or some mix of the above. None of it healthy IMO, but explainable.
In a few cases it is a dark in-joke between a small set of people that just happened to have used a public host for distribution, that unexpectedly went more viral.
- > The response.getheader method in urllib has been deprecated since 2023 … When the method was eventually removed, lots of code broke.
Two years doesn't seem long to me for a widely used project, unless you have an LTS version that people needing more stability can use, or you are upfront that your API support is two years or less. Of course API support of less than two years is fine, especially for a project that people aren't paying for, but personally I would be quite explicit from the outset (in fact I am with some bits I have out there: “this is a personal project and may change or vanish on a whim, use it in any workflow you depend on being stable at your own risk”). Or am I expecting a bit much there?
If using semver or similar you are fine to break the API at a major release point, that is what a major release means, though it would be preferable for you to not immediately stop all support for the previous major version.
> What if we intentionally made deprecated functions return the wrong result … sometimes?
Hell no. A complete break is far preferable. Making your entire API essentially a collection of undefined (or vaguely undefined) behaviours is basically evil. You effectively render all other projects that have yours as a dependency, also just collections of vaguely defined behaviours. If your API isn't set in stone, say so then people have nothing to complain about unless they specifically ask you to keep something stable (and by “ask you to keep something stable” I mean “offer to pay for your support in that matter”).
> Users that are very sensitive to the correctness of the results…
That is: any user with any sense.
> might want to swap the wrong result for an artificial delay instead.
That is definitely more palatable. A very short delay to start with, getting longer until final deprecation. How short/long is going to be very dependent on use case and might be very difficult to judge: the shortest needs to be long enough to be noticeable to someone paying attention and testing between updating dependencies and releasing, but short enough that it doesn't completely break anything if someone updates and releases quickly, perhaps to bring in a bugfix that has begun to affect their project.
This is still evil, IMO, but a much lesser evil. I'd still prefer the complete break, but then again I'm the sort of person who would pay attention to deprecation notices, so unless you are a hidden nested dependency I'd not be affected.