How to Test PDF in Playwright

Q: What is the best NPM package for testing PDF files in Playwright?

The pdf-parse package is the most straightforward choice. It extracts text content from PDFs with minimal setup. Another option is pdf2json, which gives you more structured data but has a more complex API.

Q: Do I need to save the downloaded PDF to disk before validating it?

No. Use download.createReadStream() to read the file content directly into a buffer. This avoids creating files in your project and keeps your tests clean.

Q: Can I validate PDF metadata like page count or keywords?

Yes. The pdf-parse result object includes numpages for page count and info for metadata. You can assert on result.numpages or result.info.Title directly.

Q: Can I test PDF files that contain images or charts?

The pdf-parse package extracts only text content. For visual validation of images or charts inside PDFs, you would need a screenshot-based approach using Playwright's built-in toMatchSnapshot() or a library like pixelmatch.

PDF testing comes up more often than you might think. Invoices, reports, export files, anything your application generates as a PDF will eventually need automated validation. And testing PDF files in Playwright is not complicated at all. You just need the right library.

In this article, I will show you two real scenarios. First, how to read a PDF from a URL and validate its content. Second, how to download a PDF inside a Playwright test and validate it without saving any files to disk. Let's dive in.

Installing the pdf-parse Package

We will use the pdf-parse NPM package. Simple, effective, and widely used for reading PDF documents in Node.js.

Install it as a dev dependency:

npm install pdf-parse --save-dev

After installation, the package will appear in your devDependencies in package.json. That's it.

Reading a PDF and Validating Text

Let's start with the simplest scenario. You have a PDF document available at a URL, and you want to read its content and make assertions.

A basic test that reads a PDF and prints its text content:

12345678910

import { test, expect } from '@playwright/test';import pdf from 'pdf-parse';test('read PDF content', async ({ request }) => {  const response = await request.get('https://www.princexml.com/samples/invoice-plain/index.pdf');  const body = await response.body();  const result = await pdf(body);  console.log(result.text);});

We use Playwright's request fixture to fetch the PDF, get the response body as a buffer, and pass it to pdf(). The result.text property contains the entire text content of the document.

If you are not familiar with the request fixture, check out How to Automate API using Playwright for a deeper explanation.

The simplest assertion you can make is checking that the text contains a specific value:

expect(result.text).toContain('INV-2025-001');

This works. It searches the entire text string and validates that the invoice number exists somewhere in the document. But what if you want to be more precise?

Precise PDF Validation with Regular Expressions

Let's say you want to validate in this document that 161126 number belongs specifically to the "Invoice Number" field. Or that November 26 is located next to "Invoice Date." You want to confirm that a value belongs to a specific label, not just that it exists somewhere in the text.

How?

Regular expressions. A regex pattern can match a specific portion of the text based on its surrounding context. You can match "Invoice Number" followed by the actual number and extract just the number part.

Writing regular expressions manually is painful though. Don't do it. Just ask AI to generate them for you. Open GitHub Copilot, ChatGPT, Cursor, whatever tool you have.

Something like this prompt:

Create a regex for the string output of my PDF document. I need to validate the invoice number that belongs to the label "Invoice Number."

And the AI will generate something like this:

const invoiceMatch = result.text.match(/Invoice Number:\s*(\d+)/);expect(invoiceMatch![1]).toBe('161126');

The regex /Invoice Number\s*(\d+)/ finds the text "Invoice Number," skips whitespace, and captures the next non-whitespace value. The captured group at index [1] is your invoice number.

You can do the same for any field in the document. Dates, totals, bank account numbers. Ask AI to generate the regex, plug it into your test, and validate. No brainer at all!

Downloading and Validating a PDF in Playwright

The first scenario covers PDFs accessible via a direct URL. But what about a more realistic case? You navigate to an application, click a Download button, and need to validate the downloaded PDF.

Catching the Download Event

The tricky part is timing. You need to start listening for the download event before clicking the button, otherwise you might miss it. The way to handle this in Playwright is with Promise.all:

1234

const [download] = await Promise.all([  page.waitForEvent('download'),  page.getByRole('button', { name: 'Download PDF' }).click(),]);

First, waitForEvent('download') starts listening. Then getByRole clicks the Download button, which triggers the download. Promise.all waits for both to complete, and we get the download object from the first promise.

This download object has information about the downloaded file inside the browser.

Reading the Downloaded PDF Without Saving to Disk

Now, think about it. In testing, you don't actually need the physical file saved somewhere on your computer. It just creates junk in your file system. Instead, read the data directly from the download stream.

12345678

const buffer = await download.createReadStream().then(stream => {  return new Promise<Buffer>((resolve, reject) => {    const chunks: Buffer[] = [];    stream.on('data', (chunk) => chunks.push(Buffer.from(chunk)));    stream.on('error', reject);    stream.on('end', () => resolve(Buffer.concat(chunks)));  });});

We call download.createReadStream() to get a readable stream, collect the data chunks into an array, and combine them into a single buffer. This buffer is the PDF file content in memory. No files saved anywhere.

Now pass this buffer to PDFParse() and make your assertions:

12345

const parser = new PDFParse({ data: buffer });const result = await parser.getText();const invoiceRegex = /Invoice number:\s*(\d+)/;const match = result.text.match(invoiceRegex);expect(match![1]).toBe('161126');

Full Test Example

The complete test for downloading and validating a PDF:

123456789101112131415161718192021222324252627

import { test, expect } from '@playwright/test';import pdf from 'pdf-parse';test('download and validate PDF', async ({ page }) => {  await page.goto('https://playground.bondaracademy.com/pages/extra-components/pdf-download');  const [download] = await Promise.all([    page.waitForEvent('download'),    page.getByRole('button', {name: 'Download PDF'}).click()  ]);  // Create a buffer to read the downloaded PDF  const buffer = await download.createReadStream().then(stream => {    return new Promise<Buffer>((resolve, reject) => {      const chunks: Buffer[] = [];      stream.on('data', (chunk) => chunks.push(Buffer.from(chunk)));      stream.on('error', reject);      stream.on('end', () => resolve(Buffer.concat(chunks)));    });  });  const parser = new PDFParse({ data: buffer });  const result = await parser.getText();  const invoiceRegex = /Invoice number:\s*(\d+)/;  const match = result.text.match(invoiceRegex);  expect(match![1]).toBe('161126');});

Navigate, click, catch the download, read the stream, parse, assert. That's the whole thing.

If you want to learn more about handling downloads and other advanced Playwright patterns, the Playwright UI Testing Mastery program at Bondar Academy covers these topics in depth.

Final Thoughts

That's all you need to test PDFs in Playwright. pdf-parse handles the reading, regular expressions handle the precise validation, and AI handles writing those regex patterns so you don't have to. And for downloaded files, just read the data stream directly. No files on disk, no cleanup needed.

Microsoft Playwright is growing in popularity on the market very quickly and will soon be a mainstream framework. Get the new skills at Bondar Academy with the Playwright UI Testing Mastery program. Start from scratch and become an expert to increase your value on the market!

Frequently Asked Questions

What is the best NPM package for testing PDF files in Playwright?

The pdf-parse package is the most straightforward choice. It extracts text content from PDFs with minimal setup. Another option is pdf2json, which gives you more structured data but has a more complex API.

Do I need to save the downloaded PDF to disk before validating it?

No. Use download.createReadStream() to read the file content directly into a buffer. This avoids creating files in your project and keeps your tests clean.

Can I validate PDF metadata like page count or keywords?

Yes. The pdf-parse result object includes numpages for page count and info for metadata. You can assert on result.numpages or result.info.Title directly.

Can I test PDF files that contain images or charts?

The pdf-parse package extracts only text content. For visual validation of images or charts inside PDFs, you would need a screenshot-based approach using Playwright's built-in toMatchSnapshot() or a library like pixelmatch.