- > .. women reaching out against their abusive male partners. Which IS an issue and IS statistically more likely.
Be careful about your phrasing there. I hope the implied subject on both sides of the "and" is different. Women being victims is an issue, and women reaching out is significantly more likely.
Women reaching out is (obviously) not an issue, but is statistically more likely. Alternately, women being victims is an issue, but the statistical likelihood of women being victims is unknown, and we have good reason to believe there is significant reporting bias.
- Does your parent need to graduate to be considered a legacy?
My dad went to 3 different undergraduate colleges each of his 3 years of undergrad, kicked the MCAT's teeth in, and got into med school without having graduated, went to two different med schools. (A long long time ago, probably not possible now.) Apparently the Mayo Clinic didn't mind his crazy academic record, and once he finished his residency at the Mayo, nobody else cared.
Mom went to one college, so maybe I would have been a legacy at 6 different institutions.
- Vastly under-estimating the magnitude of the task is how the crazy things get done.
Christopher Columbus wasn't unique in believing the world was round, he was rather unique in his vast under-estimation of the distance to Asia. The only reason he survived is dumb luck that the Americas were about where he thought Asia was. All of his doubters were correct that he would die before reaching Asia.
Of course, this way, way far down the list of reasons not to take Christopher Columbus as a role model.
- Thanks for this. I initially went with xfs back when there were license and quality concerns with zfs on Linux before btrfs was a thing, and moved to btrfs after btrfs was created and matured a bit.
These days, I think I would be happier with zfs and one RAID-Z pool across all of the disks instead of individual btrfs partitions or btrfs on RAID 5.
- For the completely randomized algorithm, my initial prototype was to always download the first block if available. After that, if fewer than 4 extents (continuous ranges of available bytes) were downloaded locally, randomly chose any available block. (So, we first get the initial block, and 3 random blocks.) If 4 or more extents were available locally, then always try the block after the last downloaded block, if available. (This is to minimize disk seeks.) If the next block isn't available, then the first fallback was to check the list of available blocks against the list of next blocks for all extents available locally, and randomly choose one of those. (This is to chose a block that hopefully can be the start of a bunch of sequential downloads, again minimizing disk seeks.) If the first fallback wasn't available, then the second fallback was to compute the same thing, except for the blocks before the locally available extents rather than the blocks after. (This is to avoid increasing the number of locally available extents if possible.) If the second fallback wasn't available, then the final fallback was to randomly uniformly pick one of the available blocks.
Trying to extend locally available extents if possible was desirable because peers advertised block availability as pairs of <offset, length>, so minimizing the number of extents minimized network message sizes.
This initial prototype algorithm (1) minimized disk seeks (after the initial phase of getting the first block and 3 other random blocks) by always downloading the block after the previous download, if possible. (2) Minimized network message size for advertising available extents by extending existing extents if possible.
Unfortunately, in simulation this initial prototype algorithm biased availability of blocks in rare files, biasing in favor of blocks toward the end of the file. Any bias is bad for rapidly spreading rare content, and bias in favor of the end of the file is particularly bad for audio and video file types where people like to start listening/watching while the file is still being downloaded.
Instead, the algorithm in the initial production implementation was to first check the file extension against a list of extensions likely to be accessed by the user while still downloading (mp3, ogg, mpeg, avi, wma, asf, etc.).
For the case where the file extension indicates the user is unlikely to access the content until the download is finished (the general case algorithm), look at the number of extents (continuous ranges of bytes the user already has). If the number of extents is less than 4, pick any block randomly from the list of blocks that peers were offering for download. If there are 4 or more extents available locally, for each end of each extent available locally, check the block before it and the block after it to see if they're available for download from peers. If this list of available adjacent blocks is non-empty, then randomly chose one of those adjacent blocks for download. If the list of available adjacent blocks is empty, then uniformly randomly chose from one of the blocks available from peers.
In the case of file types likely to be viewed while being downloaded, it would download from the front of the file until the download was 50% complete, and then randomly either download the first needed block, or else use the previously described algorithm, with the probability of using the previous (randomized) algorithm increasing as the percentage of the download completed increased. There was also some logic to get the last few chunks of files very early in the download for file formats that required information from a file footer in order to start using them (IIRC, ASF and/or WMA relied on footer information to start playing).
Internally, there was also logic to check if a chunk was corrupted (using a Merkle tree using the Tiger hash algorithm). We would ignore the corrupted chunks when calculating the percentage completed, but would remove corrupted chunks from the list of blocks we needed to download, unless such removal resulted in an empty list of blocks needed for download. In this way, we would avoid re-downloading corrupted blocks unless we had nothing else to do. This would avoid the case where one peer had a corrupted block and we just kept re-requesting the same corrupted block from the peer as soon as we detected corruption. There was some logic to alert the user if too many corrupted blocks were detected and give the user options to stop the download early and delete it, or else to keep downloading it and just live with a corrupted file. I felt there should have been a third option to keep downloading until a full-but-corrupt download was had, retry downloading every corrupt block once, and then re-prompt the user if the file was still corrupt. However, this option would have resulted in more wasted bandwidth and likely resulted in more user frustration due to some of them hitting "keep trying" repeatedly instead of just giving up as soon as it was statistically unlikely they were going to get a non-corrupted download. Indefinite retries without prompting the user were a non-starter due to the amount of bandwidth they would waste.
- But if you have a single board computer with 1 GB of RAM and several TB of ZFS, will it just be slow, or actually not run? Granted, my use case was abnormal, and I was evaluating in the early days when there were both license and quality concerns with ZFS on Linux. However, my understanding at the time was that it wouldn't actually work to have several TB in a ZFS pool with 1 GB of RAM.
My understanding is that ZFS has its own cache apart from the page cache, and the minimum cache size scales with the storage size. Did I misundertand/is my information outdated?
- > No serious person designing a filesystem today would say it's okay to misplace your data.
Former LimeWire developer here... the LimeWire splash screen at startup was due to experiences with silent data corruption. We got some impossible bug reports, so we created a stub executable that would show a splash screen while computing the SHA-1 checksums of the actual application DLLs and JARs. Once everything checked out, that stub would use Java reflection to start the actual application. After moving to that, those impossible bug reports stopped happening. With 60 million simultaneous users, there were always some of them with silent disk corruption that they would blame on LimeWire.
When Microsoft was offering free Win7 pre-release install ISOs for download, I was having install issues. I didn't want to get my ISO illegally, so I found a torrent of the ISO, and wrote a Python script to download the ISO from Microsoft, but use the torrent file to verify chunks and re-download any corrupted chunks. Something was very wrong on some device between my desktop and Microsoft's servers, but it eventually got a non-corrupted ISO.
It annoys me to no end that ECC isn't the norm for all devices with more than 1 GB of RAM. Silent bit flips are just not okay.
Edit: side note: it's interesting to see the number of complaints I still see from people who blame hard drive failures on LimeWire stressing their drives. From very early on, LimeWire allowed bandwidth limiting, which I used to keep heat down on machines that didn't cool their drives properly. Beyond heat issues that I would blame on machine vendors, failures from write volume I would lay at the feet of drive manufacturers.
Though, I'm biased. Any blame for drive wear that didn't fall on either the drive manufacturers or the filesystem implementers not dealing well with random writes would probably fall at my feet. I'm the one who implemented randomized chunk order downloading in order to rapidly increase availability of rare content, which would increase the number of hard drive head seeks on non-log-based filesystems. I always intended to go back and (1) use sequential downloads if tens of copies of the file were in the swarm, to reduce hard drive seeks and (2) implement randomized downloading of rarest chunks first, rather than the naive randomization in the initial implementation. I say naive, but the initial implementation did have some logic to randomize chunk download order in a way to reduce the size of the messages that swarms used to advertise which peers had which chunks. As it turns out, there were always more pressing things to implement and the initial implementation was good enough.
(Though, really, all read-write filesystems should be copy-on-write log-based, at least for recent writes, maybe having some background process using a count-min-sketch to estimate locality for frequently read data and optimize read locality for rarely changing data that's also frequently read.)
Edit: Also, it's really a shame that TCP over IPv6 doesn't use CRC-32C (to intentionally use a different CRC polynomial than Ethernet, to catch more error patterns) to end-to-end checksum data in each packet. Yes, it's a layering abstraction violation, but IPv6 was a convenient point to introduce a needed change. On the gripping hand, it's probably best in the big picture to raise flow control, corruption/loss detection, retransmission (and add forward error correction) in libraries at the application layer (a la QUIC, etc.) and move everything to UDP. I was working on Google's indexing system infra when they switched transatlantic search index distribution from multiple parallel transatlantic TCP streams to reserving dedicated bandwidth from the routers and blasting UDP using rateless forward error codes. Provided that everyone is implementing responsible (read TCP-compatible) flow control, it's really good to have the rapid evolution possible by just using UDP and raising other concerns to libraries at the application layer. (N parallel TCP streams are useful because they typically don't simultaneously hit exponential backoff, so for long-fat networks, you get both higher utilization and lower variance than a single TCP stream at N times the bandwidth.)
- > AFAIK the ObjC compiler can do this step even during compilation so that no method string names are included in the binary, but I'm not sure.
That would be possible for static binaries, but I don't see how that would work for dynamic libraries. Two libraries or a library and the executable would need the strings around in order to ensure they both got the same global address. You could mangle the strings to dynamic symbols so that it's just regular dynamic symbol resolution to get multiple loaded entities to agree on an address, but in that case, the selector string is still present in the binary in mangled form.
- I believe one rewrite of Python's dict was the first mainstream use of this sort of hash map as a default implementation.
I wish they provided a sort method to re-sort and re-index the vector to change the iteration order without the space overhead of creating a and sorting a separate vector/list of keys (or key-value pairs, depending on use case. You might want to change iteration order based on the currently held value).
- I agree with you, but would phrase it differently.
You want some indication that any leak of your current password actually hasn't been mitigated. A failure message that your password hasn't actually changed (due to being identical) is functionally the same as allowing the password change and giving a warning that the passwords were identical (modulo some back-end details like if the password salt has changed and if the password change date has been updated).
- I lost a few GMail accounts because I changed countries and computers since I created them. I tried logging in, Google said my password was correct, but both the device and the IP were unfamiliar. I don't recall exactly what was wrong with using the recovery address to recover from the problem, but that didn't work, despite my still having access to my recovery email address. I think I might need to be able to tell Google what my recovery email address is, and I may have used one of those randomized + suffixes to my recovery address.
I used to use Google Authenticator with my GMail accounts, but disabled that out of fears it's just one more thing to go wrong, with Google providing little recourse.
My password is a bit over 96 bits of entropy, generated by extracting 256 bits from /dev/urandom as a multi-precision integer, divmod'ing to extract one instance from each of the character classes (digit, lower, uppper, symbol) and then the rest from the combined alphabet (digit + lower + upper + symbol), and finally the leftover entropy used for Fisher-Yates shuffle of the password so the first digit isn't always a digit, etc. Passwords are per-site, stored using a gpg-based password manager I wrote in the early 2000s.
MFA would still help for some types of ongoing active compromise, but not for dumps of password hashes from a DB compromise. It really kills me that recovery from my recovery email address doesn't work, even though I know my password.
Honestly, if you haven't logged in from anywhere in a few months and you have the correct password, they should at least just send some verification link/code to your recovery address without requiring you to tell them your recovery address. Sure, maybe don't say where you're sending the recovery link, but turning the recovery address into another password you need to memorize without ever telling you it's some weird combination of recovery email address and recovery password is just highly annoying.
- Right. Government-mandated access to proprietary data isn't how the US breaks up monopolies, but somewhat along those lines, it might make sense for some government to provide some similar data. This seems much closer to a European style government approach, and I wouldn't expect such a thing in the U.S.
The infra for a decent crawl is prohibitive. There's a bit of black magic in crawl scheduling, and a bit in de-duplication, but most of the challenge is in scale.
I used to work on Google's indexing system, and sat with the guys who wrote the Percolator system that basically used BigTable triggers to drive indexing and make it less batch-oriented.
I know France has made at least a couple of attempts at a government-funded "Google killer" search engine. I think it would be a better use of government money to make something like a government-run event-driven first-level indexing system where search engine companies could pay basically cloud computing costs to have their proprietary triggers populate their proprietary databases based on the government-run crawling and first-level analysis. When one page updates, you'd want all of the search engine startups running their triggers on the same copy of the data, rather than having to stream the data out to each of the search engine startups.
Basically, you want to take some importance metric, some estimate of the probability some content has changed since the last time you crawled it, combine the product of the two plus some additional constraints (crawl every known page at least some maximum period, don't hit any domain too hard, etc.) as a crawl priority. You then crawl the content, convert HTML, PDF, etc. to some marked-up text format (UTF-8 HTML isn't bad, but I think UTF-8 plain text plus some separate annotations in a binary format would be better). You strip out text that's too small or too close to the background color. You calculate one or more locality-sensitive hash functions over the plain text, cluster similar texts, pick a canonical URL for each cluster. You calculate the directed link graph across clusters. The PageRank patent has expired, so you could calculate PageRank and several other link-graph ranking signals across canonical clusters. You'd presumably compute some uniqueness scores, age scores, etc. for each canonical URL, and then in parallel run each of the search engine startup's analysis over this package of analysis data each time you find a change for a particular canonical URL.
You might have some startups providing spam scoring or other analysis and providing that (for fees, of course) to search engine startups, etc. Basically, you want to modularize the indexing and analysis to provide competition and nearly seamless transition between competing providers within your ecosystem.
I think that's the way to drive innovation in the search engine startup space and properly leverage economies of scale across search engine startups.
- It was a collaborative algorithmic optimization exercise. There wasn't a "right" answer I was looking for. If they noticed something I hadn't, that would have been great. Collaborative algorithmic optimization has been a part of my job across several industries.
Among other things, I used to work on Google's indexing system, and also high volume risk calculations and market data for a large multinational bank, and now high volume trading signals for a hedge fund.
For instance, corporate clients of banks sometimes need some insurance for some scenario, but if you can narrow down that insurance to exactly what they need, you can offer them that insurance cheaper than competitors. These structured products/exotic options can be difficult to model. For instance, say an Australian life insurance provider is selling insurance in Japan, getting paid in JPY, and doing their accounting in AUD. They might want insurance against shifts in the Japanese mortality curve (Japanese dying faster than expected) over the next 30 years, but they only need you to cover 100% of their losses over 100 million AUD. You run the numbers, you sell them this insurance at a set price in AUD for the next 30 years, and you do your accounting in USD. (The accounting currency (numerare) is relevant.) There's basically nobody who would be willing to buy these contracts off of you, so to a first-order approximation, you're on the hook for these products for the next 30 years.
If you can offer higher fidelity modeling, you can offer cheaper insurance than competitors. If you re-calculate the risk across your entire multinational bank daily, you can safely do more business by better managing your risk exposures.
Daily re-calculations of risk for some structured end up cutting into the profit margins by double-digit percentages. Getting the algorithms correct, minimizing the amount of data that needs to be re-fetched, and maximizing re-use of partial results can make the difference of several million dollars in compute cost per year for just a handful of clients, and determines if the return-on-investment justifies keeping an extra structurer or two on the desk.
In order to properly manage your risk, determine what other trades all of your business are able to trade every day, etc., you need to calculate your risk exposure every day. This is basically the first partial derivatives of the value of the structured product with respect to potentially hundreds of different factors. Every day, you take current FX futures and forward contracts in order to try and estimate JPY/AUD and AUD/USD exchange rates for the next 30 years. You also make 30-year projections on the Japanese mortality curve, and use credit index prices to estimate the probability the client goes out of business (counterparty risk) for the next 30 years. Obviously, you do your best to minimize re-calculation for the 30 years, to incrementally update the simulations as the inputs change rather than re-calculating from scratch.
For the next 30 years, shifts in the Japanese mortality curves for the structured product desk affects the amount of trading in Japanese Yen and Australian Dollars that the FX desk can make, how much the Japanese equities desk needs to hedge their FX exposure, etc. You could have fixed risk allocations to each desk, but ignoring potential offsetting exposures across desks means leaving money on the table.
You can't sell these trades to another party if you find your risk calculations are getting too expensive. You are stuck for 30 years, so your only option is to roll up your sleeves and really get optimizing. Either that, or you re-calculate your risk less often and add in a larger safety margin and do a bit less business across lots of different trading desks.
I started asking this as an interview question when I noticed a colleague had implemented several O(N*2) algorithms (and even one O(N*3)) that had O(N) alternatives. Analysis for a day of heavy equity trading went from 8 hours down to an hour once I replaced my colleague's O(N*2) algorithm with an O(N) algorithm.
The point of the exercise is mostly to see if they can properly analyze the effects of algorithm changes, in an environment where the right algorithm saves tens of millions of dollars per year.
Granted, it's a bit niche, but not really that niche. I've been doing this sort of stuff in several different industries.
- The problem with Hypercard is that it used an extremely restricted subset of English constructions. Once you got used to the limitations, it was fine, but could be very frustrating for users learning what sorts of phrasing Hypercard expected.
Allowing too large a subset of English ends up allowing more ambiguous statements and more user surprise. Also, more complex grammar increases the chances of mistakes in implementation.
My current experience with LLM hallucination makes it clear that at present, we can't just throw an LLM at the parsing and semantic analysis side of programming language implementation.
- I used to ask how to find the 10th percentile value from an arbitrarily ordered list as an interview question. Most candidates suggested sorting, and then I'd ask if they could do better. If they got stuck, I'd ask them which sorting algorithm they'd suggest. If they suggested quicksort, then I could gently guide them down optimizing quicksort to quickselect. Most candidates made the mistake of believing getting rid of half the work at every division results in half the work overall. They realized it was significantly faster, but usually didn't realize it was O(N) expected time.
If we had time, I'd ask about the worst-case scenario, and see if they could optimize heapsort to heapselect. Good candidates could suggest starting out with selectsort optimistically and switching to heapselect if the number of recursions exceeded some constant times the number of expected recursions.
If they knew about median-of-medians, they could probably just suggest introselect at the start, and move on to another question.
- Disclaimer, I work for a market-neutral fund, and have close friends high up in prop shops.
Presuming all strategies have a curve of diminishing marginal returns as assets under management increase, you would not expect any fund accepting outside money to have expected returns beating the market, but you would expect many of them to have a combination of correlation to the market and expected returns that would make them an attractive component in a basket of broad index ETFs and market-neutral funds. (Assuming risk-adjusted returns are the utility function being optimized. If variance is their preferred risk metric, this results in optimizing Sharpe ratio via mean-variance optimization, MVO.)
It's fair to assume that any fund manager is optimizing the sum of returns from their own personal investments in the fund plus fees from outside investors. They pick the place on the volume/risk-adjusted-returns curve that still keeps their fund attractive enough to outside investors, and maximizes their personal profits (personal returns plus fund fees).
If that optimal point on the volume/risk-adjusted returns curve for their particular strategy is at a point where risk-adjusted returns beat the market, then they maximize their returns by either never accepting outside funds (prop shops) or by not accepting additional funds and gradually buying out their investors (such as RenTech's famous Medallion fund).
So, (assuming diminishing marginal returns) it's not rational to simultaneously accept outside investment and beat the market on a risk-adjusted basis.
I suspect that many market-neutral funds could reliably beat the market on a risk-adjusted basis, but their volume/risk-adjusted-returns curve shape and their fee structures make it optimal for them to operate at a point on that curve where their expected returns are below the market.
Note that this rational self-interest optimization below market returns isn't bad for the investors. Under most fee structures, it ends up being close to maximizing total investor returns. Increasing percentage returns would mean kicking out some investors.
RenTech's Medallion Fund and many prop shops, and funds that are currently slowly buying out their investors seem to indicate there are at least some strategies where the optimal volume/returns trade-off is above market returns. You would expect all funds that are currently open to more outside investment to either be young and lacking capital or else have an optimal point on the volume/returns curve that is below market returns.
Note that as previously mentioned, a simple mean-variance optimization on a basket would allocate funds to both index ETFs and market-neutral funds returning a bit under the market on average. It's entirely possible that both fund investors and fund managers are being perfectly rational.
Of course, there are also plenty of people out there who fool themselves into thinking they know what they're doing. The world certainly isn't perfectly rational.
I'm just saying that in a perfectly rational world, assuming (1) utility function of risk-adjusted-returns (e.g. Sharpe ratio, resulting in mean-variance-optimization) (2) declining marginal returns on investment, you would expect all funds accepting outside investors (except for young funds desperate for money) to under-perform the market in expected returns.
Now, everyone talks about Sharpe ratio on the outside, but the particular risk models actually used internally by any fund are almost certainly not just variance of returns. I presume all funds simultaneously apply a mixture of commercially available risk models and internally developed risk models. Sharpe ratio is far from perfect, but it's a good least-common-denominator for discussion, and doesn't give away any secret sauce.
Side note: it would be rational for someone to take you up on your proposal and simply use index futures to take a highly leveraged position on your benchmark index. As long as they had enough money to make you whole in the case of bad tracking error and large downturns, their expected returns would be large. However, you wouldn't be very smart to take such an agreement instead of just getting leverage yourself. This demonstrates why risk-adjusted returns are usually more important than expected returns.
- I'm not aware how to get byte-for-byte identical source out of a Java class file or .NET assembly.
Last I checked, Java AoT compilation precluded runtime re-optimization, though I presume they've fixed that by now.
Last I checked, they both used stack-based bytecode, which typically takes longer to JIT and results in slower native code than a compressed SSA / control flow graph (see the SafeTSA papers).
- If your hash map uses open addressing, instead of a sparse array of pair<key, value>, you can have a vector<pair<key,value>> and a sparse array holding offsets into the vector. Depending on the sizes of keys, values, and offsets, as well as the average loading factor, this might or might not save space.
If your hash map uses chaining, then you weave an extra doubly linked list through your entries (see OpenJDK's OrderedHashMap, for a pretty readable open source example).
Though, the downside is that I do have less incentive to protect myself if I'm in malaria/dengue/etc. areas.