I made this comment yesterday but really applies to this conversation.
> In the past 3 weeks I ported Playwright to run completely inside a Chrome extension without Chrome DevTools Protocol (CDP) using purely DOM APIs and Chrome extension APIs, I ported a TypeScript port of Browser Use to run in a Chrome extension side panel using my port of Playwright, in 2 days I ported Selenium ChromeDriver to run inside a Chrome Extension using chrome.debugger APIs which I call ChromeExtensionDriver, and today I'm porting Stagehand to also run in a Chrome extension using the Playwright port. This is following using VSCode's core libraries in a Chrome extension and having them drive a Chrome extension instead of an electron app.
The most difficult part is managing the lifecycle of Windows, Pages, and Frames and handling race conditions, in the case of automating a user's browser, where, for example, the user switches to another tab or closes the tab.
Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.
We need the agent to be able to drive 1password, Privacy.com, etc. to request per-task credentials, change adblock settings, get 2fa codes, and more.
The holy grail really is CDP + control over browser launch flags + an extension bridge to get to the more ergonomic `chrome.*` APIs. We're also working on a custom Chromium fork.
Use an Electron app to spawn a child process to open a Chrome browser using the launch flags including `--remote-debugging-pipe` -- instead of exposing a websockets connection on port 9226 or something -- which, if coupled with `--user-data-dir=<path>`, will not show the security CDP bar warning at the top of the page as long as the user data directory is not the default user directory.
1. Get all the things you want.
2. Can create as many 'browser context' personas as you want
3. Use the Electron app renderer for UI to manage profiles, proxies for each profile, automate making gmail accounts for each profile, ect.
4. Forgot, it is very nice using the `--load-extension=/path/to/extension` flag to ship chrome extension files inside the Electron app bundle so that the launched browser will have a cool copilot side panel.
> Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.
5. If you know the extension ids it is easy to set up communication between the two. I already drive a Chrome extension using VSCode's core libraries and it would be a week or two of work to implement a light port of the VSCode host extension API but for a Chrome extension. Nonetheless, I'd rather have an Electron app to manage extensions the same way a VSCode does.
Shipping a whole electron app is not a priority at the moment though, our revenue comes from cloud API users, and there we only need our custom chrome fork, no point messing with electron and extension bridges when we can add custom CDP commands to talk to `chrome.*` APIs directly.
I like the Chrome fork idea. I imagine in the next couple years, hardware companies, i.e. Apple, Lenovo, will start to ship extremely power local inference hardware as the models become sufficient which your browser will be able to leverage.
Yes an electron app helps tremendously, especially for managing lifecycle of tabs independently. We use that for creating our AI browser automations at Donobu (https://donobu.com). However, we do have the luxury of just focusing on a narrow AI QA use case vs. Browser-Use and others who need to support broad usecases in potentially adversarial environments.
What is the benefit of porting all those tools to extensions? Have you ran into any other extension-based challenges besides lifecycles and race conditions?
Some benefits (without using Chrome.debugger or Chrome DevTools Protocol):
1. There are 3,500,000,000 instances of Chrome desktop being used. [0]
2. A Chrome Extension can be installed with a click from the Chrome Web Store.
3. It is closer to the metal so runs extremely fast.
4. Can run completely contained on the users machine
5. It's just one user automating their web based workflows making it harder for bot protections to stop and with a human-in-the-loop any hang ups and snags can be solved by the human
6. Chrome extensions now have a side panel that is stationary in the window during navigation and tab switching. It is exactly like using the Cursor or VSCode side panel copilots
Some limitations:
1. Can't automate ChatGPT console because they check for user agent events by testing if the `isTrusted` property on event objects is true. (The bypass is using Chrome.debugger and the ChromeExtensionDriver I created.)
2. Can't take full page screen captions however it is possible to very quickly take visible scree captions of the viewport. Currently I scroll and stitch the images together if a full page screen is required. There are other APIs which allow this in a Chrome Extension and can capture video and audio but they require the user to click on some button so it isn't useful for computer vision automation. (The bypass is once again using the Chrome.debugger and ChromeExtensionDriver I created.)
3. Chrome DevTool Protocol allows intercepting and rewriting scripts and web pages before they are evaluated. With manifest v2 this was possible but they removed this ability in manifest v3 which we still hear about today with the adblock extensions.
I feel like with the limitations having a popup dialog that directs the user to do an action will work as long as it automates 98% of the user's workflows. Moreover, a lot of this automation should require explicit user acknowledgments before preceding.
> Currently I scroll and stitch the images together if a full page screen is required.
Actually, I wish this was exposed as an alternative full-page screenshot method in CDP. The dev tools approach very frequently does not work with SPAs that lazy load/unload, etc.
> What is the benefit of porting all those tools to extensions?
Personally, I have a browser extension running in my user/personal browser instance that my agent use (with rate-limits) in order to avoid all the captchas and blocks basically. Everything else I've tried ultimately ends up getting blocked. But then I'm also doing some heavy caching so most agent "browse" calls end up not even reaching out to the internet as it's finding and using stuff already stored locally.
What I have so far needs a lot of work and is flaky. Everyday it is getting tighter and better.
Microsoft pulled out the lifecycle management code from Puppeteer and put it into Playwright with Google's copyright still at the top of the several files. They both use CDP. I'm using the Chrome extension analogue for every CDP message and listener. I need a couple days to remove all the code from the Page, Frame, FrameManager, and Browser classes and methodically make a state machine with it to track lifecycle and race conditions. It is a huge task and I don't want to share it without accomplishing that.
For example, there is a system that listens for all navigation requests in a Page's / Tab's Frames in Playwright. Embedded frames can navigate to urls which the parent Frame is still loading such as advertising resources, all that needs to be tracked.
There are a lot of companies that are talking about building solutions using CDP without Playwright and I'm curious how well they are going to handle the lifecycle management. Maybe if they don't intercept requests and responses it is very straight forward and simple.
One idea I have is just evaluate '1+1' in the frame's content script in a loop with a backoff strategy and if it returns 2 then continue with code execution or if it times out fail instead of tracking hundreds of navigations with with 30 different embedded frames in a page like CNN. I'm still tinkering. Stagehand calls Locator.evaluate() which is what I'm building because I haven't implemented it yet.
Yes the key is we don't intercept requests and responses, that saves 60% of the headache of lifecycle management.
We do exactly what you described with a 1+1 check in a loop for every target, it pops any crashed sessions from the pool, and we don't keep any state beyond that about what tabs are alive. We really try to derive everything fresh from the browser on every call, with minimal in-memory state.
Ah, yes, the classic "Playwright isn't fast enough so we're reinventing Puppeteer" trope. I'd be lying if I haven't seen this done a few times already.
Now that I got my snarky remark out of the way:
Puppeteer uses CDP under the hood. Just use Puppeteer.
I've seen a team implement Go workers that would download the HTML from a target, then download some of the referenced JavaScript files, then run these JavaScript files in an embedded JavaScript engine so that they could consume less resources to get the specific things that they needed without using a full browser. It's like a browser homunculus! Of course, each new site would require custom code. This was for quant stuff. Quite cool!
I don't know how well it would work for that use-case, but I've used it before, for example, to write a web-crawler that could handle client-side rendering.
Is the case for playwright over puppeteer just in it's crossbrowser support?
We're currently using Cypress for some automated testing on a recent project and its extremely brittle. Considering moving to playwright or puppeteer but not sure if that will fix the brittleness.
In my experience Playwright provided a much more stable or reliable experience with multiple browser support and asynchronous operations (which is the entire point) over Puppeteer. ymmv
I would definitely recommend puppeteer if you can, it's maintained by the Chrome team and always does things the "approved way". The only reason we did playwright is because we're a python library and pyppeteer was abandoned.
They're all brittle in my experience but Playwright has a lovely test recorder and test runner which is also integrated into VSCode, and it tidies up a lot of the exceptions that would occur in puppeteer if the page state wasn't meticulously-ready for some operation.
Playwright's "trace" viewer is also fantastic providing periodic snapshots and performance debugging.
I have converted several large E2E test suites from Cypress to Playwright, and I can vouch that it is the better option. Cypress seems to work well at first, but it is extremely legacy heavy, its API is convoluted and unintuitive, and stacks a bunch of libraries/frameworks together. In comparison, Playwright's API is much more intuitive, yes you must 'await' a lot, but it is a lot easier to handle side effects (e.g. making API calls), it can all just be promises.
It is also just really easy to write a performant test suite with Playwright, it is easy to parallelize, which is terrible in Cypress, almost intentionally so to sell their cloud products, which you do not need. The way Playwright works just feels more intuitive and stable to me.
Describing "2011–2017" as "the dark ages" makes me feel so old.
There was a ton of this stuff before Chrome or WebKit even existed! Back in my day, we used Selenium and hated it. (I was lucky enough to start after Mercury...)
Hi! Sorry, I was trying to be a bit tongue in cheek here. This space, in my experience, has always been frustrating, because it's a hard problem. I myself am fighting with Playwright these days, just like I used to fight with Selenium. (And, to my understanding, you created Selenium due to frustrations with Mercury, hence the name... I'm curious if that's true or just something I heard!)
I still deeply appreciate these tools, even though I also find them a bit frustrating.
it's all good, man. if it makes you feel better, i don't like rust. ;-) my eldest son loves it, though!
fun-fact: i've never used mercury. when i came up with "selenium" -- it was because a colleague saw an early demo and said it had the potential to "kill mercury". (spoiler alert!)
but in that moment, i hadn't heard of mercury before, so i had to google it. i then also spent a few extra cycles googling around for a "cure for mercury poisoning" just so i could continue the conversation with that colleague with a proto-dad-joke... and landed on a page about selenium supplements. things obviously got out of hand.
i didn't want to call the project "selenium". i preferred the name "check engine", but people started calling it "selenium" anyway. i only wish nice things for the mercury team -- the only thing i know about them is that hp acquired mercury for $4.5B. so i hope they blissfully don't care about me or my bad dad-jokes.
but again... i didn't realize there was an entire testing tools industry at that moment. all i knew was that i had a testing problem for my complicated web app -- and the consensus professional advice at the time was "yeah, no. don't use javascript in the browser -- it's too hard to test". (another spoiler.) also, (if i'm remembering correctly) mercury was ie/windows only... and i needed something that supported apple and mozilla/firefox. it felt like zero vendors at the time cared about anything that wasn't internet explorer or wasn't windows. so i had to chart my own course pretty quickly.
long story long: "you either die a hero, or you live long enough to see yourself become the villain" - harvey dent
> it's all good, man. if it makes you feel better, i don't like rust. ;-) my eldest son loves it, though!
Ha! Yeah, it's no worries at all, I think it's fine to not like things. Everybody is different. And for these sorts of things, it's kind of a "there are two kinds of tools, the ones people complain about, and the ones they don't use" sort of situation: if I didn't think it was valuable, I just wouldn't use it. But it's valuable enough to use despite the griping at times.
Wow hi! Thanks so much for building selenium! I've used it many times in my career, and I looked at Selenium Grid for inspiration for browser devops in my last job.
Scrolling to an element doesn’t always work because somehow the element might not be ready. You need to add ids to the element and select by that to ensure it works properly.
thanks! yeah, playwright was a huge improvement there -- waiting until an element was actually ready. the official posture from the selenium project ("figure it out, be explicit") wasn't always the most user friendly messaging.
having to add ids to elements is one of those classic tradeoffs -- the alternative was to use css or xpath selectors, which can be even worse, maintenance-wise. i'm secretly hoping ai code-gen apps pumped out by things like Lovable or Claude Code automagically generate element test-ids and the tests for you and we never have to worry about it again.
i'm at the edge of my chrome internals knowledge here, but i'd answer the question with a question: isn't backendnodeid only stable within a single session?
that might not matter if the agent is re-finding the element between sessions anyway, but then you're paying a lookup cost (time + tokens) each time. compared to just using document.getelementbyid() on an explicit id.
iirc it's stable across sessions until the tab closes, even though their docs dont guarantee it.
we cant modify the dom to add IDs because we'd get detected by block-blockers very quickly. we're gradually trying to get rid of all DOM tampering entirely for that reason.
that's what i thought. :) personal life accomplishment was seeing wikipedia add a disambiguation link on the element's page. you know, because it's right up there in importance as the periodic table, obviously.
2011 were definitely not the dark ages!! I used to use Selenium for everything back in the day. I was able to scrape all of Wikipedia in 2011 entirely on my laptop and pipe it to Stanford NLTK to create a very cool adjective recommender for nouns.
direct CDP has been used by the scraping community for a long time in order to have a cleaner browser environment that is harder to fingerprint. for example nodriver (https://github.com/ultrafunkamsterdam/nodriver) was started in Feb 2024 and I suspect this technique was popular before that project started.
I really like both nodriver and pydoll. I am definitely keeping the option of switching to them open, but we just wanted to have full control for now and see how painful CDP-use is to maintain first and then reconsider.
Nice thorough write up, I've had my share of annoyances with playwright for automating some menial tasks due to being blocked by captcha or other waf (I'm just logging into my own accounts and scraping my account balance, nothing nefarious), I'll try out pydoll or your library next time.
this is exactly what I did when I wrote my first agent with scraping. later we switched to taking control of the users browser through a browser extension.
What do you mean? We use CDP page snapshots extensively to get full html across frames but it's not nearly enough on its own, there are lots of checks still needed for individual OOPIFs or elements.
you still need separate calls to get the AXTree with full computed aria properties, and a bunch of Runtime.evaluate calls to scan all the dynamically-added event listeners.
i like that the post uses the phrase "time is a flat circle". it is indeed. once upon a time, most devs only cared about one browser -- internet explorer. then for a good chunk of time, cross-browser compatibility was highly valued. now, most devs only care about one browser -- google chrome.
it's a bummer, but also a market reality... the best way to get more devs to care about non-chrome browsers is to get more people to use non-chrome browsers. easier said than done, though.
> All of the approaches of driving the browser outside of the browser is going to be slow
Why? I would think any cross-process communication through the CDP websocket would have imperceptible overhead compared to what already takes long in the browser: a ton of HTTP I/O
What is Karma? What are you executing in the browser?
It was, but I feel like the advent of headless browsers marked a step function explosion in browser automation. Also any earlier than 2010 is when I was like 13yo, so it's more like "the dark ages in my own memory" than "objectively dark ages in automation history".
I get that drawing historical boundaries is arbitrary, but Selenium is a really good prior.
Selenium offered headless mode and integrated with 3rd party providers like BrowserStack, which ran acceptance tests in parallel in the cloud. It seems like what browser-use.com is doing is a modern day version with many more features & adaptability.
Sauce Labs is excellent. I've actually used it extensively myself (not sure why BrowserStack came to mind first). I remember Sauce Labs was super active in the SF Selenium community and the Selenium meetups. Just checked my emails. Good memories.
Talk about "not built here" mentality. This is a project doomed to failure. Using VC money to re-write better built software which has been around for years.
Can you please make your substantive points without snark or putdowns? Thoughtful criticism is fine, of course, but what you posted here goes against what we're trying for in this community.
From their blog its not obvious the value but pure cdp as a framework is powerful for other reasons. If you have very high performace requirements it makes sense.
I build something like an automation system pure cdp to shave ms off. But I'm a real time user interaction system plus automation not pure ai automation.
Doesn't make much sense to shave ms when an LLM call is hundreds of ms ans that's the only "user"
Exactly what I was thinking. Instead of attempting to contribute back to Playwright to fix those hangups, or even creating a private patch to do so as a POC, they went right to building their own framework from scratch.
I've been trying to contribute to playwright for years! All of my issues have been closed / rejected without much consideration because they're not part of the core "QA testing" use-case that playwright is built for.
Personally have not found their team to be the easiest to work with on Github. I would've loved to use puppeteer instead, their team is quite reasonable but they abandoned their python bindings and we want to stay in python.
re: ms -- thank you for calling that out. i've been thinking we had been collectively sleepwalking into ms owning everything (again). they've owned everything once before -- it wasn't great!
related side-note: have you had to interact with the core chrome / cdp devs?
Chromium bug tracker is where issues go to die, but aside from that I've had nothing but lovely individual interactions with core chrome devs so far. The devtools frontend/protocol repo is definitely active and more approachable than Chromium itself.
I have not spoken to people that work directly on CDP yet, but I believe we have a call with them soon!
I mean... Playwright was built and is maintained by Microsoft, so I don't think VC money argument really makes sense here.
By the very nature of how Playwright is built we can't contribute to it - it runs inside a JS subprocess and does not expose a bunch of CDP apis that we NEED (for example to make cross origin iframes work).
I made this comment yesterday but really applies to this conversation.
> In the past 3 weeks I ported Playwright to run completely inside a Chrome extension without Chrome DevTools Protocol (CDP) using purely DOM APIs and Chrome extension APIs, I ported a TypeScript port of Browser Use to run in a Chrome extension side panel using my port of Playwright, in 2 days I ported Selenium ChromeDriver to run inside a Chrome Extension using chrome.debugger APIs which I call ChromeExtensionDriver, and today I'm porting Stagehand to also run in a Chrome extension using the Playwright port. This is following using VSCode's core libraries in a Chrome extension and having them drive a Chrome extension instead of an electron app.
The most difficult part is managing the lifecycle of Windows, Pages, and Frames and handling race conditions, in the case of automating a user's browser, where, for example, the user switches to another tab or closes the tab.
Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.
We need the agent to be able to drive 1password, Privacy.com, etc. to request per-task credentials, change adblock settings, get 2fa codes, and more.
The holy grail really is CDP + control over browser launch flags + an extension bridge to get to the more ergonomic `chrome.*` APIs. We're also working on a custom Chromium fork.
Use an Electron app to spawn a child process to open a Chrome browser using the launch flags including `--remote-debugging-pipe` -- instead of exposing a websockets connection on port 9226 or something -- which, if coupled with `--user-data-dir=<path>`, will not show the security CDP bar warning at the top of the page as long as the user data directory is not the default user directory.
1. Get all the things you want.
2. Can create as many 'browser context' personas as you want
3. Use the Electron app renderer for UI to manage profiles, proxies for each profile, automate making gmail accounts for each profile, ect.
4. Forgot, it is very nice using the `--load-extension=/path/to/extension` flag to ship chrome extension files inside the Electron app bundle so that the launched browser will have a cool copilot side panel.
> Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.
5. If you know the extension ids it is easy to set up communication between the two. I already drive a Chrome extension using VSCode's core libraries and it would be a week or two of work to implement a light port of the VSCode host extension API but for a Chrome extension. Nonetheless, I'd rather have an Electron app to manage extensions the same way a VSCode does.
Yeah I started building this in my first week at the company haha: https://github.com/browser-use/desktop
Shipping a whole electron app is not a priority at the moment though, our revenue comes from cloud API users, and there we only need our custom chrome fork, no point messing with electron and extension bridges when we can add custom CDP commands to talk to `chrome.*` APIs directly.
I like the Chrome fork idea. I imagine in the next couple years, hardware companies, i.e. Apple, Lenovo, will start to ship extremely power local inference hardware as the models become sufficient which your browser will be able to leverage.
Yes an electron app helps tremendously, especially for managing lifecycle of tabs independently. We use that for creating our AI browser automations at Donobu (https://donobu.com). However, we do have the luxury of just focusing on a narrow AI QA use case vs. Browser-Use and others who need to support broad usecases in potentially adversarial environments.
What is the benefit of porting all those tools to extensions? Have you ran into any other extension-based challenges besides lifecycles and race conditions?
Some benefits (without using Chrome.debugger or Chrome DevTools Protocol):
1. There are 3,500,000,000 instances of Chrome desktop being used. [0]
2. A Chrome Extension can be installed with a click from the Chrome Web Store.
3. It is closer to the metal so runs extremely fast.
4. Can run completely contained on the users machine
5. It's just one user automating their web based workflows making it harder for bot protections to stop and with a human-in-the-loop any hang ups and snags can be solved by the human
6. Chrome extensions now have a side panel that is stationary in the window during navigation and tab switching. It is exactly like using the Cursor or VSCode side panel copilots
Some limitations:
1. Can't automate ChatGPT console because they check for user agent events by testing if the `isTrusted` property on event objects is true. (The bypass is using Chrome.debugger and the ChromeExtensionDriver I created.)
2. Can't take full page screen captions however it is possible to very quickly take visible scree captions of the viewport. Currently I scroll and stitch the images together if a full page screen is required. There are other APIs which allow this in a Chrome Extension and can capture video and audio but they require the user to click on some button so it isn't useful for computer vision automation. (The bypass is once again using the Chrome.debugger and ChromeExtensionDriver I created.)
3. Chrome DevTool Protocol allows intercepting and rewriting scripts and web pages before they are evaluated. With manifest v2 this was possible but they removed this ability in manifest v3 which we still hear about today with the adblock extensions.
I feel like with the limitations having a popup dialog that directs the user to do an action will work as long as it automates 98% of the user's workflows. Moreover, a lot of this automation should require explicit user acknowledgments before preceding.
[0] https://www.demandsage.com/chrome-statistics/
> Currently I scroll and stitch the images together if a full page screen is required.
Actually, I wish this was exposed as an alternative full-page screenshot method in CDP. The dev tools approach very frequently does not work with SPAs that lazy load/unload, etc.
> What is the benefit of porting all those tools to extensions?
Personally, I have a browser extension running in my user/personal browser instance that my agent use (with rate-limits) in order to avoid all the captchas and blocks basically. Everything else I've tried ultimately ends up getting blocked. But then I'm also doing some heavy caching so most agent "browse" calls end up not even reaching out to the internet as it's finding and using stuff already stored locally.
Wouldnt having chrome.debugger=true also flag your requests?
is this open source ? just curious to see this. sounds fascinating!
What I have so far needs a lot of work and is flaky. Everyday it is getting tighter and better.
Microsoft pulled out the lifecycle management code from Puppeteer and put it into Playwright with Google's copyright still at the top of the several files. They both use CDP. I'm using the Chrome extension analogue for every CDP message and listener. I need a couple days to remove all the code from the Page, Frame, FrameManager, and Browser classes and methodically make a state machine with it to track lifecycle and race conditions. It is a huge task and I don't want to share it without accomplishing that.
For example, there is a system that listens for all navigation requests in a Page's / Tab's Frames in Playwright. Embedded frames can navigate to urls which the parent Frame is still loading such as advertising resources, all that needs to be tracked.
There are a lot of companies that are talking about building solutions using CDP without Playwright and I'm curious how well they are going to handle the lifecycle management. Maybe if they don't intercept requests and responses it is very straight forward and simple.
One idea I have is just evaluate '1+1' in the frame's content script in a loop with a backoff strategy and if it returns 2 then continue with code execution or if it times out fail instead of tracking hundreds of navigations with with 30 different embedded frames in a page like CNN. I'm still tinkering. Stagehand calls Locator.evaluate() which is what I'm building because I haven't implemented it yet.
Yes the key is we don't intercept requests and responses, that saves 60% of the headache of lifecycle management.
We do exactly what you described with a 1+1 check in a loop for every target, it pops any crashed sessions from the pool, and we don't keep any state beyond that about what tabs are alive. We really try to derive everything fresh from the browser on every call, with minimal in-memory state.
https://github.com/browser-use/browser-use/blob/2a0f4bd93a43...
Ha, I got that idea from you! Sitting there in the back of my mind.
Ah, yes, the classic "Playwright isn't fast enough so we're reinventing Puppeteer" trope. I'd be lying if I haven't seen this done a few times already.
Now that I got my snarky remark out of the way:
Puppeteer uses CDP under the hood. Just use Puppeteer.
I've seen a team implement Go workers that would download the HTML from a target, then download some of the referenced JavaScript files, then run these JavaScript files in an embedded JavaScript engine so that they could consume less resources to get the specific things that they needed without using a full browser. It's like a browser homunculus! Of course, each new site would require custom code. This was for quant stuff. Quite cool!
This exact homunculus is actually supported in Node.JS by the `jsdom` library: https://www.npmjs.com/package/jsdom
I don't know how well it would work for that use-case, but I've used it before, for example, to write a web-crawler that could handle client-side rendering.
our use primary use-case with the AI stuff is not really scraping, we're mostly going after RPA
sir we are a python library, puppeteer-python was abandoned, how exactly do you propose we use puppeteer?
yeah, i continue to be amazed at how google dropped the ball on this one.
Playwright has Python bindings .
yes I know, I wrote the post
Have you considered just using Playwright? ;)
Is the case for playwright over puppeteer just in it's crossbrowser support?
We're currently using Cypress for some automated testing on a recent project and its extremely brittle. Considering moving to playwright or puppeteer but not sure if that will fix the brittleness.
In my experience Playwright provided a much more stable or reliable experience with multiple browser support and asynchronous operations (which is the entire point) over Puppeteer. ymmv
I would definitely recommend puppeteer if you can, it's maintained by the Chrome team and always does things the "approved way". The only reason we did playwright is because we're a python library and pyppeteer was abandoned.
Most of the Puppeteer team left and joined Playwright under Microsoft.
They're all brittle in my experience but Playwright has a lovely test recorder and test runner which is also integrated into VSCode, and it tidies up a lot of the exceptions that would occur in puppeteer if the page state wasn't meticulously-ready for some operation.
Playwright's "trace" viewer is also fantastic providing periodic snapshots and performance debugging.
I have converted several large E2E test suites from Cypress to Playwright, and I can vouch that it is the better option. Cypress seems to work well at first, but it is extremely legacy heavy, its API is convoluted and unintuitive, and stacks a bunch of libraries/frameworks together. In comparison, Playwright's API is much more intuitive, yes you must 'await' a lot, but it is a lot easier to handle side effects (e.g. making API calls), it can all just be promises.
It is also just really easy to write a performant test suite with Playwright, it is easy to parallelize, which is terrible in Cypress, almost intentionally so to sell their cloud products, which you do not need. The way Playwright works just feels more intuitive and stable to me.
Playwright also offers nice sugar like HTML test reports and trace viewing.
From my experience with Playwright RR-Web recordings are MUCH better than Playwright’s replay traces, so we usually just use those.
What's RR web?
https://github.com/rrweb-io/rrweb
That can be integrated with Playwright, or did you mean to say it is already used under the hood for their reports?
Gregor was saying it works without needing playwright, and provides more detailed trace recordings than playwright does.
we plan to use rr-web and maybe browsertrix for our website archival / replay system for deterministic evals.
Describing "2011–2017" as "the dark ages" makes me feel so old.
There was a ton of this stuff before Chrome or WebKit even existed! Back in my day, we used Selenium and hated it. (I was lucky enough to start after Mercury...)
selenium creator here. hi!
Hi! Sorry, I was trying to be a bit tongue in cheek here. This space, in my experience, has always been frustrating, because it's a hard problem. I myself am fighting with Playwright these days, just like I used to fight with Selenium. (And, to my understanding, you created Selenium due to frustrations with Mercury, hence the name... I'm curious if that's true or just something I heard!)
I still deeply appreciate these tools, even though I also find them a bit frustrating.
it's all good, man. if it makes you feel better, i don't like rust. ;-) my eldest son loves it, though!
fun-fact: i've never used mercury. when i came up with "selenium" -- it was because a colleague saw an early demo and said it had the potential to "kill mercury". (spoiler alert!)
but in that moment, i hadn't heard of mercury before, so i had to google it. i then also spent a few extra cycles googling around for a "cure for mercury poisoning" just so i could continue the conversation with that colleague with a proto-dad-joke... and landed on a page about selenium supplements. things obviously got out of hand.
i didn't want to call the project "selenium". i preferred the name "check engine", but people started calling it "selenium" anyway. i only wish nice things for the mercury team -- the only thing i know about them is that hp acquired mercury for $4.5B. so i hope they blissfully don't care about me or my bad dad-jokes.
but again... i didn't realize there was an entire testing tools industry at that moment. all i knew was that i had a testing problem for my complicated web app -- and the consensus professional advice at the time was "yeah, no. don't use javascript in the browser -- it's too hard to test". (another spoiler.) also, (if i'm remembering correctly) mercury was ie/windows only... and i needed something that supported apple and mozilla/firefox. it felt like zero vendors at the time cared about anything that wasn't internet explorer or wasn't windows. so i had to chart my own course pretty quickly.
long story long: "you either die a hero, or you live long enough to see yourself become the villain" - harvey dent
> it's all good, man. if it makes you feel better, i don't like rust. ;-) my eldest son loves it, though!
Ha! Yeah, it's no worries at all, I think it's fine to not like things. Everybody is different. And for these sorts of things, it's kind of a "there are two kinds of tools, the ones people complain about, and the ones they don't use" sort of situation: if I didn't think it was valuable, I just wouldn't use it. But it's valuable enough to use despite the griping at times.
Thank you for the story!
Off topic, but it's threads like these that keep me on HN. Gold :)
selenium helped my team so much back in the days. Thank you for it!
We had a complex user registration workflow that supported multiple nationalities and languages in a international bank website.
I setup selenium tests to detect breakages because it was almost humanly impossible to retest all workflows after every sprint.
It brought back sanity to the team and QA folks.
Tools that came after certainly benefitted from selenium lessons.
I tweaked the article text a bit, if anyone has more history to fill in I'd love to collect browser automation lore!
https://github.com/browser-use/website/blob/main/posts/playw...
Wow hi! Thanks so much for building selenium! I've used it many times in my career, and I looked at Selenium Grid for inspiration for browser devops in my last job.
I always enjoyed Selenium, for what it’s worth.
You have to love how the OP completely left Selenium out of their "history".
I just wanted to say I absolutely love your product. Thank you!
Hi, the first version of Browser Use was actually built on Selenium but we quite quickly switched to Playwright
yeah, i noticed that. apologies if i missed a post about it... what do you wish didn't suck about selenium?
Scrolling to an element doesn’t always work because somehow the element might not be ready. You need to add ids to the element and select by that to ensure it works properly.
thanks! yeah, playwright was a huge improvement there -- waiting until an element was actually ready. the official posture from the selenium project ("figure it out, be explicit") wasn't always the most user friendly messaging.
having to add ids to elements is one of those classic tradeoffs -- the alternative was to use css or xpath selectors, which can be even worse, maintenance-wise. i'm secretly hoping ai code-gen apps pumped out by things like Lovable or Claude Code automagically generate element test-ids and the tests for you and we never have to worry about it again.
whats the downside of using frameId/targetId+backendNodeId as the stable element ids?
i'm at the edge of my chrome internals knowledge here, but i'd answer the question with a question: isn't backendnodeid only stable within a single session?
that might not matter if the agent is re-finding the element between sessions anyway, but then you're paying a lookup cost (time + tokens) each time. compared to just using document.getelementbyid() on an explicit id.
iirc it's stable across sessions until the tab closes, even though their docs dont guarantee it.
we cant modify the dom to add IDs because we'd get detected by block-blockers very quickly. we're gradually trying to get rid of all DOM tampering entirely for that reason.
Uh, ahem, <clears throat>, we meant the _other_ Selenium.
that's what i thought. :) personal life accomplishment was seeing wikipedia add a disambiguation link on the element's page. you know, because it's right up there in importance as the periodic table, obviously.
2011 were definitely not the dark ages!! I used to use Selenium for everything back in the day. I was able to scrape all of Wikipedia in 2011 entirely on my laptop and pipe it to Stanford NLTK to create a very cool adjective recommender for nouns.
Lol I came here to write this exact comment about the dark ages and selenium. I, too, feel old.
i suspect this is how vim and emacs developers feel every time someone announces a new vscode fork.
direct CDP has been used by the scraping community for a long time in order to have a cleaner browser environment that is harder to fingerprint. for example nodriver (https://github.com/ultrafunkamsterdam/nodriver) was started in Feb 2024 and I suspect this technique was popular before that project started.
I really like both nodriver and pydoll. I am definitely keeping the option of switching to them open, but we just wanted to have full control for now and see how painful CDP-use is to maintain first and then reconsider.
Nice thorough write up, I've had my share of annoyances with playwright for automating some menial tasks due to being blocked by captcha or other waf (I'm just logging into my own accounts and scraping my account balance, nothing nefarious), I'll try out pydoll or your library next time.
this is exactly what I did when I wrote my first agent with scraping. later we switched to taking control of the users browser through a browser extension.
Why not cdp snapshot?
What do you mean? We use CDP page snapshots extensively to get full html across frames but it's not nearly enough on its own, there are lots of checks still needed for individual OOPIFs or elements.
You can get all of that pure snapshot.
There are no extra checks needed it's by a significant margin the most reliable method to see current state.
I run snapshot at 10-20fps though plus the same for parallel image capture.
I've been wondering if I should release just this part of my system open source seems like I'm not alone in how complex this all is.
I could launch yet another automation framework!
you still need separate calls to get the AXTree with full computed aria properties, and a bunch of Runtime.evaluate calls to scan all the dynamically-added event listeners.
Snapshot is all you need. I get all of the info you describe in pure snapshot. Automation works fine.
you cannot get the list of bound onclick handlers with a snapshot
You also can't run it at 10-20fps
Umm, will this run on Firefox too? They deprecated CDP and favors Webdriver Bidi.
i like that the post uses the phrase "time is a flat circle". it is indeed. once upon a time, most devs only cared about one browser -- internet explorer. then for a good chunk of time, cross-browser compatibility was highly valued. now, most devs only care about one browser -- google chrome.
it's a bummer, but also a market reality... the best way to get more devs to care about non-chrome browsers is to get more people to use non-chrome browsers. easier said than done, though.
No we are not planning to support Firefox. We do support Brave, Edge, and ungoogled-chromium though if you have a problem with Google.
All of the approaches of driving the browser outside of the browser is going to be slow (webdriver, playwright, puppeteer, etc).
Karma like approaches are where I’m at (execute in the browser)
> All of the approaches of driving the browser outside of the browser is going to be slow
Why? I would think any cross-process communication through the CDP websocket would have imperceptible overhead compared to what already takes long in the browser: a ton of HTTP I/O
What is Karma? What are you executing in the browser?
CDP rountrip time on a local machine is 100µs (0.1ms), it's not slow haha
"Thousands of cdp calls" from the link.
Cdp does add a good chunk of latency. Depends on what your threshold is.
An image grab is around 60ms and a snapshot can range from 40ms -> 500ms
The latency is pure data movement. It's like the difference of using ram vs ssd vs data from the internet.
yeah but good luck getting rid of that with a browser extension, you're just moving the latency around / moving it to chrome.runtime message passing.
Selenium was very usable before 2011.
This post is like saying Grafana and not mentioning Nagios
It was, but I feel like the advent of headless browsers marked a step function explosion in browser automation. Also any earlier than 2010 is when I was like 13yo, so it's more like "the dark ages in my own memory" than "objectively dark ages in automation history".
I get that drawing historical boundaries is arbitrary, but Selenium is a really good prior.
Selenium offered headless mode and integrated with 3rd party providers like BrowserStack, which ran acceptance tests in parallel in the cloud. It seems like what browser-use.com is doing is a modern day version with many more features & adaptability.
Yeah I agree, I changed the history section a bit: https://github.com/browser-use/browser-use/blob/2a0f4bd93a43...
speaking of priors... sauce labs existed for three whole years before browserstack (selenium and sauce founder here. :-)
i like that there are new startups in the space, though. things were getting pretty stale and uninspired.
Sauce Labs is excellent. I've actually used it extensively myself (not sure why BrowserStack came to mind first). I remember Sauce Labs was super active in the SF Selenium community and the Selenium meetups. Just checked my emails. Good memories.
Thank you for building Selenium.
Talk about "not built here" mentality. This is a project doomed to failure. Using VC money to re-write better built software which has been around for years.
Good luck guys!
Can you please make your substantive points without snark or putdowns? Thoughtful criticism is fine, of course, but what you posted here goes against what we're trying for in this community.
https://news.ycombinator.com/newsguidelines.html
From their blog its not obvious the value but pure cdp as a framework is powerful for other reasons. If you have very high performace requirements it makes sense.
I build something like an automation system pure cdp to shave ms off. But I'm a real time user interaction system plus automation not pure ai automation.
Doesn't make much sense to shave ms when an LLM call is hundreds of ms ans that's the only "user"
it does when we have multiple LLMs working in parallel on a single tab, which we're working towards eventually
Exactly what I was thinking. Instead of attempting to contribute back to Playwright to fix those hangups, or even creating a private patch to do so as a POC, they went right to building their own framework from scratch.
That isn't how you launch a product.
I've been trying to contribute to playwright for years! All of my issues have been closed / rejected without much consideration because they're not part of the core "QA testing" use-case that playwright is built for.
Personally have not found their team to be the easiest to work with on Github. I would've loved to use puppeteer instead, their team is quite reasonable but they abandoned their python bindings and we want to stay in python.
re: ms -- thank you for calling that out. i've been thinking we had been collectively sleepwalking into ms owning everything (again). they've owned everything once before -- it wasn't great!
related side-note: have you had to interact with the core chrome / cdp devs?
Chromium bug tracker is where issues go to die, but aside from that I've had nothing but lovely individual interactions with core chrome devs so far. The devtools frontend/protocol repo is definitely active and more approachable than Chromium itself.
I have not spoken to people that work directly on CDP yet, but I believe we have a call with them soon!
awesome! tell them there are dozens of python fans. dozens!
re side-note: if you know anyone who would be willing to interact connect me :)
i was going to ask the same to you. :-)
i'm just stubborn enough to find out, though. and i still have a few contacts at the googleplex...
Spoken like someone who has never contributed to open source but just grandstands about how everyone else needs to.
I mean... Playwright was built and is maintained by Microsoft, so I don't think VC money argument really makes sense here.
By the very nature of how Playwright is built we can't contribute to it - it runs inside a JS subprocess and does not expose a bunch of CDP apis that we NEED (for example to make cross origin iframes work).