- kukkeliskuu parentWhat do you think is the bottleneck?
- >The mobile app wasn't complex (literally only does the things outlined above) and I've done enough mobile development and graphics/computer vision development before that the stack and concepts involved weren't completely unknown, just the specifics of the various iOS APIs and how to string them together - hence why I initially thought it would be a good use case for AI. > >It was also an incredible coincidence that the toy app I wanted to build had an apple developer tutorial that did almost the same thing as what I was looking to build, and so yes, I clearly would have been better off using the documentation as a starting point rather than the AI.
Ok. I have done similar, too. For example, when starting a new Django project, I will rather copy an old project as basis than create a new from scratch with LLM.
If there already exists full documentation or repo of exactly what you are trying to do and/or it is something you have already done many times, then LLM might not add too much value, and may even be a hindrance.
- For example, I just got: "I've identified the core issue - one of the table cells is evaluating to None, causing the "Unknown flowable type" error. This requires further debugging to identify which specific cell is problematic."
- What I meant by "not treating LLM as senior" is that the disillusionment phase culminates in an a-ha moment which could be described a "LLM is not a senior developer". This a-ha moment is not intellectual, but emotional. It is possible to same time think that LLM is not a senior developer, but not realize it emotionally. This emotional realization in turn has consequences.
>The threading issue didn't come up at all in any of this. > >Once it came, the AI tied itself in knots trying to sort it out, coming up with very complex dispatching logic that still got things incorrect.""
Yes. These kind of loops have happened to me as well. It sometimes requires clearing of context + some inventive step to help the LLM out of the loop. For example my ad pacing feature required that I recognized that it was trying to optimize the wrong variable. I consider this to be partly what I mean by "LLM is a junior" and that "I act as the project manager".
> I guess my point is that the amount of effort involved to use English to direct and correct the AI often outweighs the effort involved to just do it myself.
Could you really have done a complex mobile app alone in one day without knowing the stack well beforehand? I believe this of stuff used to take months from a competent team not long time ago. I certainly could not have done one year ago what I can do today, with these tools.
- > I realized the Claude version had a threading issue waiting to happen that was explicitly warned against in the docs of the api calls it was using.
I am reading between the lines here, trying genuinely to be helpful, so forgive me if I am not on the right track.
But based on what you write, it seems to me you might have not really gone through the disillusionment phase yet. You seem to be assuming the models "understand" more than they really are capable of understanding, which creates expectations and then disappointment. It seems to be you are still expecting CC to work at a level of a senior professional on various roles, instead of assuming it is a junior professional.
I would have probably approached that iOS app by first investigting various options how the app could be implemented (especially as I don't have deep understanding of the tech), and then explore each option to understand myself what is the best one.
The options in your example might be the Apple documentation page. It it might be some open source repo that contains something that could be used as a starting point etc.
Then I would have asked Claude to create a plan to implement the best option.
During either the option selection or planning, the threading issue would either come up or not. It might come up explicitly, in which case I could learn it from the plans. It might be implicit, just included in the generated code. Or it might not be included in the plans or in the code, even if it is explicitly stated in the documentation. If the suggested plan would be based on that documentation, then I would probably read it myself too, and might have seen the suggestion.
When reviewing the plan, I can use my prior knowledge to ask whether that issue has been taken into account. If not, then Claude would modify the plan. Of course, if I did not know about the threading issue beforehand, and did not have the general experience about the tech to suspect such as a issue, nor read the documentation and see the recommendation, I could not find the issue myself either.
If the issue is not found in planning or progamming, the issue would arise at later stage, hopefully while unit/system testing the application, or pilot use. I have not written complex iOS apps personally so I might have not caught it either -- I am not senior enough to guide it. I would ask it to plan again how to comprehenively test such an app, to learn how it should be done.
What I meant by standard SWE practices is that there are various stages (requirements, specification, design, programming, testing, pilot use) where the solution is reviewed from multiple angles, so it becomes likely that this kind of issues are caught. The best practices also include iteration. Start with something small that works. For example, first an iOS application that compiles, and shows "Hello, world" etc. and can be installed on your phone.
In my experience, CC cannot be expected to independently work as a senior professional on any role (architect, programmer, test manager, tester, pilot user, product manager, project manager). Junior might not take into account all instructions or guidance even if it is explicit. But it can act as a junior professional on any of these roles, so it can help senior professional to get the 10x productivity boost on any of these areas.
By project manager role, I mean that I am explicitly taking the CC through the various SWE stages and making sure they have been done properly, and also that I iterate on the solution. On each one of the stages, I take the role of the respective senior professional. If I cannot do it yet, I try to learn how to do it. At the same time, I work as a product manager/owner as well, to make decisions about the product, based on my personal "taste" and requirements.
- My question was misleading. For me Claude Code appears sometimes to stop too often at a random point instead to ask instead of keeping going. I guess that is the point of the linked article that Codex works differently in this regard.
- How do you get it to stop to ask you something sometimes when it is doing its thing?
- > It was so impressed with its new performance it started adding > rocket ship emojis to the output summary.
I laughed more at this than I probably should have, out of recognition.
- > This is step 3 of “draw the rest of the owl” :-)
Fair enough :-)
This reminds me about pigeon research by Skinner. Skinner placed hungry pigeons in a "Skinner box" and a mechanism delivered food pellets at fixed, non-contingent time intervals, regardless of the bird's behavior. The pigeons, seeking a pattern or control over the food delivery, began to associate whatever random action they were performing at the moment the food appeared with the reward.
I think we humans have similar psychology, i.e. we tend to associate superstitions about patterns of what were doing when we got rewards, if they happen at random intervals.
To me it seems we are at a phase where what works with LLMs *(the reward) are still quite random, but it is psychologically difficult for us to admit it. Therefore we try to invent various kinds of theories of why something appears to work, which are closer to superstitions than real repeatable processes.
It seems difficult to really generalize repeatable processes of what really works, because it depends on too many things. This may be the reason why you are unsuccessful when using these descriptions.
But while it seems less useful to try to work based on theories of what works -- although I had skeptical attitude -- I have found that LLMs can be huge productive boost -- but it really depends on the context.
It seems you just need to keep trying various things, and eventually you may find out what works for you. There is no shortcut where you just read a blog post and then you can do it.
Things I have tried succesfully: - modifying existing large-ish Django projects, adding new apps to it. It can sometimes use Django components&HTMX/AlpineJS properly, but sometimes starts doing something else. One app uses tenants, and LLM appears to constantly struggle with this. - creating new Django projects -- this was less successful than modifying existing projects, because LLM could not imitate practices - Apple Swift mobile and watch applications. This was surprisingly succesful. But these were not huge apps. - python GUI app was more or less succesful - GitHub Pages static web sites based on certain content
I have not copied any CLAUDE.md or other files. Every time Claude Code does something I don't appreciate, I add a new line. Currently it is at 26 lines.
I have made a few skills. They are mostly so that they can work independently in a loop, for example test something that does not work.
Typically I try to limit the technologies to something I know really well. When something fails, I can often quickly figure out what is wrong.
I started with the basic plan (I guess it is that $30/month). I only upgraded to $100 Max and later to $180 2xMax because I was hitting limits.
But reason I was hitting limits was because I was working on multiple projects on multiple environments at the same time. The only difference I have seen is that I have hit the limits. I have not seen any difference in quality.
- For me, it has gone through stages.
Initially I was astounded by the results.
Then I wrote a large feature (ad pacing) on a site using LLMs. I learned the LLMs did not really understand what they were doing. The algorithm (PID controller) itself was properly implemented (as there is plenty of data to train on), but it was trying to optimize the wrong thing. There were other similar findings where LLM was doing very stupid mistakes. So I went through a disillusionment stage and kind of gave up for a while.
Since then, I have learned how to use Claude Code effectively. I have used it mostly on existing Django code bases. I think everybody has a slightly different take on how it works well. Probably the most reasonable advice is to just keep going and try different kind of things. Existing code bases seem easier, as well as working on a spec beforehand, requiring tests etc. basic SWE principles.
- I think this is similar in other fields, and appears to be related to your self-esteem. Some junior (and sometimes even senior) developers may have hard time accepting improvements on their design and code. If you are identified with your code, you may be unwilling to listen to suggestions from others. If you are confident, you will happily accept and consider suggestions from others and are able to admit that anything can be improved, and others can give you valuable insight.
Similarly, it appears that some doctors are willing to accept that they have limited amount of time to learn about specific topics, and that a research-oriented and intelligent patient very interested in few topics can easily know more about it. In such a case a conducive mutual learning experience may happen.
One doctor told me that what he is offering is statistical advice, because some diseases may be very rare and so it makes sense to rule out more common diseases first.
Other doctors may become defensive, if they have the idea that the doctor has the authority and patients should just accept that.
- > You can give them a consistent snapshot quite easily.
How would you do that in a standard event sourcing system where data originates from multiple sources?
- That is a fair point, although I believe that misleading name is contributing to the confusion.
- Having worked as the lead architect for bank core payment systems, multiple bank scenario is a special case that is way too complex for the purpose of these discussions.
It is a multi-layered processes that ultimately makes it very probable that the state of a payment transaction is consistent between banks, involving reconciliation processes, manual handling of failed transactions over extended time period if the reconciliation fails, settlement accounts for each of the involved banks and sometimes even central banks for instant payments.
But I can imagine scenarios when even those can fail to make the transaction state globally consistent. For example a catastrophic event that destroys the bank's systems and a small bank has failed to take off-site backups, and one payment has some hic-up so that the receiving bank cannot know what happened with the transaction. So they would probably assume something.
- I know. That is why it is useful way to think about it, because it both is true and makes you think.
- It might arise from necessity, but what I see in practice that even senior developers deprioritize consistency on platforms and backends apparently just because the scalability and performance is so fashionable.
That pushes the hard problem of maintaining a consistent experience for the end users to the frontend. Frontend developers are often less experienced.
So in practice you end up with flaky applications, and frontend and backend developers blaming each other.
Most systems do not need "webscale". I would challenge the idea that "eventual consistency" is an acceptable default.
- It is highly relevant in many contexts. I see in my work all the time that developers building frontends on top of microservices believe that because the system is called "eventually consistent", they can ignore consistency issues in refences between objects, causing flaky apps.
- Branches, commits and merges are the means how people manually resolve conflicts so that a single repository can be used to see a state where revision steps forward in perfect lockstep.
In many branching strategies this consistent state is called "main". There are alternative branching stragies as well. For example the consistent state could be a release branch.
Obviously that does not guarantee ordering across repos, hence the popularity of "monorepo".
Different situations require different solutions.
- Both of your statements are true.
But in practice we are rarely interested in single writes when we talk about consistency, but the consistency of multiple writes ("transactions") to multiple systems such as microservices.
It is difficult to guarantee consistency by stopping writes, because whatever enforces the stopping typically does not know at what point all the writes that belong together have been made.
If you "stop the writes" for sufficiently long, the probability of inconsistencies becomes low, but it is still not guaranteed to be non-existant.
For instance in bank payment systems, end-of-day consistency is handled by a secondary process called "reconciliation" which makes the end-of-day conflicts so improbable that any conflict is handled by a manual tertiary process. And then there are agreed timeouts for multi-bank transactions etc. so that the payments ultimately end up in consistent state.
- I have no beef with the academic, careful definitions, although I dislike the practice where academics redefine colloquial terms more formally. That actually causes more, not less confusion. I was talking about the colloquial use of the term.
If I search for "eventual consistency", the AI tells me that one of the cons for using eventual consistency is: "Temporary inconsistencies: Clients may read stale or out-of-date data until synchronization is complete."
I see time and time again in actual companies that have "modern" business systems based on microservices that developers can state the same idea but have never actually paused to think that you something is needed to do the "synchronization". Then they build web UIs that just ignore the fact, causing application to become flaky.
- "... with your reasoning no complex system is ever consistent with ongoing changes. From the perspective of one of many concurrent writers outside of the database there’s no consistency they observe."
That was kind of my point. We should stop callings such systems consistent.
It is possible, however, to build a complex system, even with "event sourcing", that has consistency guarantees.
Of course your comment has the key term "outside of the database". You will need to either use a database or built a homegrown system that has similar features as databases do.
One way is to pipe everything through a database that enforces the consistency. I have actually built such an event sourcing platform.
Second way is to have a reconciliation process that guarantees consistency at certain point of time. For example, bank payments systems use reconciliation to achieve end-of-day consistency. Even those are not really "guaranteed" to be consistent, just that inconsistencies are sufficiently improbable, so that they can be handled manually and with agreed on timeouts.
- I have worked with core systems in several financial institutions, as well as built several event sourcing production systems used as the core platform. One of these event sourcing systems was actually providing real consistency guarantees.
Based on my experience, I would recommend against using bank systems as an example of event sourcing, because they are actually much more complex than what people typically mean when they talk about event sourcing systems.
Bank systems cannot use normal event sourcing exactly because of the problem I describe. They have various other processes to have sufficiently probable consistency (needed by the bank statements for example), such as "reconciliation".
Even those do not actually "guarantee" anything, but you need tertiary manual process to fix any inconsistencies (on some days after the transaction). They also have timeouts agreed between banks to eventually resolve any inconsistencies related to cross-bank payments over several weeks.
In practice this means that the bank statements for source account and target account may actually be inconsistent with each other, although these are so rare that most people never encounter them.
- Yes, I agree. I don't really believe we can change the terminology. But maybe we can get some people to at least think about the consistency model when using the term.
- I was not trying to "define" eventually consistent, but to point out that people typically use the term quite loosely, for example when referring to the state of the system-of-systems of multiple microservices or event sourcing.
Those are never guaranteed to be in consistent state in the sense of C in ACID, which means it becomes the responsibility of the systems that use the data to handle the consistency. I see this often ignored, causing user interfaces to be flaky.
- My point was kind of tongue-in-cheek. Like the other comment suggests, I was talking about how people actually use the term "eventually consistent" for example to refer to system-of-systems of multiple microservices or event sourcing systems. It is possible to define and use the terms more exactly like you suggest. I have no problem with that kind of use. But even if you use the terms more carefully, most people do not, meaning that when you talk about these systems using the misunderstood terms, people may misunderstand you although you are careful.
- Your example is too simple to show the problem with the "eventual consistency" as people use the term in real life.
Let's say you have two systems, one containing customers (A) and other containing contracts for the customers (B).
Now you create a new contract by first creating the customer in system A and then the contract on system B.
It may happen that web UI shows the contract in system B, which refers to the customer by id (in system A), but that customer becomes visible slightly after in system A.
The web UI has to either be built to manage the situation where fetching customer by id may temporarily fail -- or accept the risk that such cases are rare and you just throw an error.
If a system would be actually "eventually consistent" in the sense you use the term, it would be possible for the web UI to get guarantee from the system-of-systems to fetch objects in a way that they would see either both the contract and the customer info or none.
- I think we should stop calling these systems eventually consistent. They are actually never consistent. If the system is complex enough and there are always incoming changes, there is never a point in time in these "eventually consistent systems" that they are in consistent state. The problem of inconsistency is pushed to the users of the data.
- My feeling about GitHub Pages is that it is not unreasonable just to forget about the site. For any cheap shared hosting I would psychologically feel the need to monitor the site somehow, periodically check that the credit card works etc.
For me, there is a large relative (percentual) difference in the perceived cognitive load. Perhaps not a huge actual difference, but when you are running tens of projects, everything counts.
Now I am not talking about actual reality but the psychological effect. It might be that some shared hosting site is in fact more reliable than GitHub Pages.
Obviously, a blog that you just forget is not that useful, but last site I created using this method was an advertisement site for a book. I have several blogs where I write occasionally.
- Reliability? It took me around 15 minutes to create a site with Claude Code using GitHub Pages with custom domain and somebody else is taking care that it is always running. What is the alternative?