> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
This is absolutely not what you are doing, which means what you have here is not a robot. What you have here is a user agent, so you don't need to pay attention to robots.txt.
If what you are doing here counted as robotic traffic, then so would:
* Speculative loading (algorithm guesses what you're going to load next and grabs it for you in advance for faster load times).
* Reader mode (algorithm transforms the website to strip out tons of content that you don't want and present you only with the minimum set of content you wanted to read).
* Terminal-based browsers (do not render images or JavaScript, thus bypassing advertising and according to some justifications leading them to be considered a robot because they bypass monetization).
The fact is that the web is designed to be navigated by a diverse array of different user agents that behave differently. I'd seriously consider imposing rate limits on how frequently your browser acts so you don't knock over a server—that's just good citizenship—but robots.txt is not designed for you and if we act like it is then a lot of dominoes will fall.
Maybe some new standards and maybe a user configurable per site permissions may make it better?
I'm curious to see how this will turn out to be.
Why? My user agent is configured to make things easier for me and allow me to access content that I wouldn't otherwise choose to access. Dark mode allows me to read late at night. Reader mode allows me to read content that would otherwise be unbearably cluttered. I can zoom in on small text to better see it.
Should my reader mode or dark mode or zoom feature have to respect robots.txt because otherwise they'd allow me to access content that I would otherwise have chosen to leave alone?
I know its not completely true, I know reader mode can help you bypass the ads _after_ you already had a peek at the cluttered version, but if you need to go to the next page or something like that you need to disable reader-mode once and so on, so its a very granular ad-blocking while many AI use cases are about bypassing viewing it at all by a human; and the other thing is that reader mode is not very popular so its not a significant threat.
*or other links on their websites, or informative banners, etc
As a user, the browser is my agent. If I'm directing an LLM to do something on a page in my browser, it's not that much different than me clicking a button manually, or someone using a screen reader to read the text on a page. The browser is my user agent and the specific tools I choose to use in my browser shouldn't be forbidden by a webpage. (that's why to this day all browsers still claim to be Mozilla...)
(This is very different than mass scraping web pages for training purposes. Those should absolutely respect robots.txt. There's a big difference between a user operated agentic-browser interacting with a web page and mass link crawling.)
If any type of AI based assistance is supposed to adhere to the robot.txt, then would you also say that AI based accessibility tools should refuse to work on pages blocked by robot.txt?
If your browser behaves, it's not going to be excluded in robots.txt.
If your browser doesn't behave, you should at least respect robots.txt.
If your browser doesn't behave, and you continue to ignore robots.txt, that's just... shitty.
No, it's common practice to allow Googlebot and deny all other crawlers by default [0].
This is within their rights when it comes to true scrapers, but it's part of why I'm very uncomfortable with the idea of applying robots.txt to what are clearly user agents. It sets a precedent where it's not inconceivable that we have websites curating allowlists of user agents like they already do for scrapers, which would be very bad for the web.
[0] As just one example: https://www.404media.co/google-is-the-only-search-engine-tha...
I am not sure I agree with an AI-aided browser, that will scrape sites and aggregate that information, being classified as "clearly" a user agent.
If this browser were to gain traction and ends up being abusive to the web, that's bad too.
Where do you draw the line of crawler vs. automated "user agent"? Is it a certain number of web requests per minute? How are you defining "true scraper"?
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
To me "recursive" is key—it transforms the traffic pattern from one that strongly resembles that of a human to one that touches every page on the site, breaks caching by visiting pages humans wouldn't typically, and produces not just a little bit more but orders of magnitude more traffic.
I was persuaded in another subthread that Nxtscape should respect robots.txt if a user issues a recursive request. I don't think it should if the request is "open these 5 subreddits and summarize the most popular links uploaded since yesterday", because the resulting traffic pattern is nearly identical to what I'd have done by hand (especially if the browser implements proper rate limiting, which I believe it should).
But wonder if it matter if it the agent is mostly using it for "human" use cases and not scrapping?