Tags: English

Using puppeteer to find missing translations

3 October, 2018
Reading time: 4 min 46 sec
Tech blog post #01

In the last quarter we’ve been developing multi-language version of our website. In a short time, we have worked on additional 5 languages aside of 2 basic versions. More and more are on a horizon but there was doubt lingering if we’re covering all our text in translation.

Our website is built on both client and server side:

– simple views are rendered by node with translations provided by polyglot

– components that are interactive are using Vue.js with translations given by vue-i18n

– both are using the same language source base

We’ve been logging errors and if we accessed page that was lacking translation in given language, the lacking phrase was also logged. This however, left us with few problems:

–  we could only find missing translations while visiting the page,

– we had no automated way of testing

– multiple visits to subsite generated multiple log entries with the same error which made it difficult to read through

In this article I will focus on catching errors from Vue.js rendered components.

Since we had logging in place all I had to do is to trigger all possible errors to see which translations are missing. For that we needed a tool that would enter each site in every language and check console for errors. Since we have a sitemap ready the remaining piece was an automated browser that would enter it and scrape log errors. I decided use puppeteer which is a node.js library that provides api to control headless Chrome.

The created script has to:

  1.   Download list of urls
  2.   Visits each subsite  and save any logs with missing translations from Vue.js views
  3.   Prints out all missing translation

We already have a sitemap available on the website. I’m using axios to send a GET request that receives the sitemap. Since sitemap is in xml I’m using xml2js to parse it and extract urls.

I am using es7 async/await keywords to wait for the result of axios get request.  The sitemap location depends on the environment the app is running in. You can see example sitemap for production at – https://www.smsapi.com/sitemap.xml.

Note: If you’re using development environment that doesn’t use valid ssl certificate make sure to disable TLS verification.

  1.   Visit each subsite and save any logs with missing translations

I will create a function that will launch browser, add listener to console and save every log with missing translation warning and return it.

I’ll define an async function as most puppeteer actions are asynchronous. It will receive urls array that was fetched and parsed in the first step. First let’s launch a browser.

Since we’re assigning it to a variable I can now reference it as it’s an actual browser. By using this reference I can open pages, switch between the tabs and a lot more. Check documentation to see the possibilities.

As you see I’m ignoring https errors since I don’t use a valid ssl certificate in development (only self-signed).

Now I can open up a page which actually works as opening a new tab in the chrome browser.

Before I start crawling the site I want to add a listener on console. That will catch the logs with missing translations. Here’s a sample console log:

[vue-i18n] Value of key 'Explore case studies and learn about Customers who are succeeding with SMSAPI:' is not a string!

As you can see it start with “[vue-i18n] Value ” that I can catch with regex

Now I’ll add a listener that will push all unique logs that start with the phrase.

Now the fun part – actual webcrawling. We will loop through list of urls and visit each page. After we’re done we close the browser.

There’s a catch though – it’s all done asynchronously and as you can see I’m not using Array.foreach. Since Javascript doesn’t have a asyncForEach function we need to implement by ourself.

Now puppeteer will visit each site and save missing translations to array. We only need to return it at the end.

  1.   Print all missing translation

Since we have function needed to get urls and missing translations let’s look at how we’d actually use it.

In order to get urls and use our puppeteer webcrawler I need we need to do it inside asynchronous block or using try/catch blocks. I will use the first approach.

Now I can print it out to console(which is stdout in case of node).

Now last thing remaining is to kill node process.

The final code looks like this:

Sample output:

[vue-i18n] Value of key 'Explore case studies and learn about Customers who are succeeding with SMSAPI:' is not a string!

[vue-i18n] Value of key 'form.placeholder.email' is not a string!

The output can be piped into file so it can be viewed/modified later node missing-translations-crawler.js > report.

In this article I wanted to focus on logging translation errors but the script was extended:

– sitemap location was parameterized using – commander

– language was parameterized and can filtered from urls list to speed up the process

– each language is printed out separately in the raport

Creating webcrawler with puppeteer was a pleasure and entire process (crawling through 200 pages) takes less than 2 minutes. It’s extremely useful for development purposes as it gives us more confidence when adding new language or making changes. We’re currently using it in a jenkins job so we always have a current raport if any errors occurred.

Author: Józef Piecyk