Playwright ID scraper

31 January 2023by Jack McGregor

This will be a short post because I'm tired:
At Antler we've recently implemented a new pattern for creating Playwright test ID selectors dynamically see here which I know is a bit of a Playwright anti-pattern but we only have the budget to use Playwright as smoke tests rather than fully-fledged E2E tests so time is money!
Furthermore, given that both the front-end and back-end teams have their own Playwright use-cases, we needed an easy way to get all the IDs on a page and ideally make them type-safe both for regressive tests and intellisense.
As a disclaimer, we use NextJS for our front-end so simply scraping the source files doesn't work:

Part 1: The scraper

The TLDR of this is function is that takes in the HTML as a string (eg "<body>/*...*/</body>"), uses a regex matcher to find all occurences of our selector (`data-playwright-id="<the_id_we_want>"`), and then parse them in 4 ways:
import fs from 'fs';
import util from 'util';
function camelize(str: string) {
return str
.toLowerCase()
.replace(/[^a-zA-Z0-9]+(.)/g, (m, chr) => chr.toUpperCase());
}
interface LocatorList {
[key: string]: string | LocatorList;
}
export const scrapePlaywrightIds = async (page: string, html: string) => {
try {
const playwrightIdRegExp = /data-playwright-id=\"([^\"]+)\"/g;
const ids = html.matchAll(playwrightIdRegExp);
let idArray = [];
let text = `// ${page}\r`; // top of the page
for (const id of ids) {
const _id = id[0].replace('data-playwright-id=', '');
text += _id.replaceAll('"', '\r');
idArray.push(_id.replaceAll('"', ''));
}
// map over idArray
const reducer: any = idArray.reduce(
(prev, selector: string) => {
// split by .
const split = selector.split('.');
// else reduce each value into an object
const reducedSelector: LocatorList = split.reduceRight(
(prevSelector, selector, i) => {
const _ = [...split].slice(0, i + 1).join('.');
return {
[selector]: {
_,
...prevSelector
}
};
},
{}
);
return {
...prev,
selectors: {
[split[0]]: {
// @ts-ignore
...prev.selectors?.[split[0]],
// @ts-ignore
...reducedSelector[split[0]]
}
}
};
},
{ page }
);
/**
* ========================
* PLAIN TEXT
* ========================
*/
fs.writeFileSync(`./output/${page}.txt`, text);
/**
* ========================
* JSON
* ========================
*/
fs.writeFileSync(
`./output/${page}.json`,
JSON.stringify(reducer, null, '\t')
);
/**
* ========================
* JS / TS
* ========================
*/
const func = camelize(page.replace(/-/, ' '));
const js = `const ${func} = ${util.inspect(
reducer,
true,
20
)};\n\nmodule.exports = ${func};`;
fs.writeFileSync(`./output/${page}.js`, js, 'utf-8');
if (!fs.existsSync('./playwright/test-ids/output')) {
fs.mkdirSync('./playwright/test-ids/output');
}
const ts = `export const ${func} = ${util.inspect(reducer, true, 20)}`;
fs.writeFileSync(`./playwright/test-ids/output/${page}.ts`, ts, 'utf-8');
} catch (error) {
console.error(error);
}
};
  1. Plain text
  2. JSON
  3. JS
  4. TS
You clearly don't need all of them but having the options is useful. In our case, since we want these typed for a better dev experience, we want to take the .ts files and create declaration files for them. We can do this by using the `typescript` package to emit only declaration files:
// package.json
"scripts": {
"playwright:types": "tsc ./path/to/your/ts-files/*.ts --declaration --emitDeclarationOnly --outDir types/playwright --esModuleInterop"
}

Part 2: The scraping

Ok we have the script, awesome, but right now it's a manual process. We're going to implement this into a separate Playwright script that will crawl, scrape, store and parse all of our selectors.
// package.json
{
"playwright:ids": "playwright test --config playwright-id.config.ts",
"playwright:types": "tsc ./path/to/your/ts-files/*.ts --declaration --emitDeclarationOnly --outDir types/playwright --esModuleInterop",
"playwright:scrape": "yarn playwright:ids && yarn playwright:types",
}
// playwright-id.config.ts
/* eslint-disable no-console */
import { PlaywrightTestConfig } from '@playwright/test';
import dotenv from 'dotenv';
import path from 'path';
import env from './config';
if (!env.VERCEL_ENV) {
// NECESSARY FOR TO LOAD ENV'S WITHIN THE PLAYWRIGHT TESTS
dotenv.config({ path: path.resolve(__dirname, '.env') });
}
const config: PlaywrightTestConfig = {
workers: process.env.CI ? 4 : undefined,
fullyParallel: false,
testDir: './playwright/test-ids', // location of tests
maxFailures: 1, // env.CI ? 1 : undefined,
use: {
headless: true,
baseURL: env.URL,
viewport: { width: 1400, height: 980 },
browserName: 'chromium',
testIdAttribute: 'data-playwright-id'
},
...(!env.VERCEL_ENV && {
webServer: {
port: 3000,
command: 'yarn dev'
}
})
};
export default config;
// playwright/test-ids/index.spec.ts
import { Page, test } from '@playwright/test';
import config from '../../config';
import { scrapePlaywrightIds } from '../../scripts/playwright-id-scraper';
const pages: string[] = [
'api-documentation',
'batch-reports/new',
'batch-reports',
'organisation',
'reports',
'report',
'settings',
'sme-calculator'
];
async function* scrapeAndCompile(page: Page) {
const getBody = async () => await page.locator('body').innerHTML();
for (let i = 0; i < pages.length; i++) {
const url = pages[i];
await page.goto(`${url}`);
await page.waitForURL(`${url}`);
const bodyText = await getBody();
const pageName = url.replaceAll('/', '-');
yield scrapePlaywrightIds(pageName, bodyText);
}
}
test.describe('Fetch all data-playwright-ids', async () => {
test('Dashboard', async ({ browser }) => {
const page = await browser.newPage();
try {
for await (const value of scrapeAndCompile(page)) {
console.log('value', value);
}
} catch (error) {
console.log(error);
}
});
});
And there it is - when you run the script "yarn playwright:scrape" it will run playwright with a specific config which targets a specific directory which runs no tests, but instead runs our scraping script for each page that gets rendered. Pretty neat. After Playwright has run it will convert those files to declaration files so you can use intellisense in your tests like so:
import { test } from '@playwright/test';
import { report } from '../../types/playwright/report'
test.describe('Tests', async () => {
test('User can use Intellisense', async ({ browser }) => {
const page = await browser.newPage()
const selector = await page.locator(report.selectors["reports"]["pending-section"]._).click() // autofilled with intellisense
});
});
Seed file for PayloadCMS23 January 2023Rendering languages in code blocks17 January 2023Creating dynamic test IDs for Jest and Playwright17 January 2023