Sitecore Search - Multi Language PDF Meta Data Crawling
Summary
Sitecore Search is a powerful machine learning search platform that is part of the Sitecore DXP ecosystem.
Recently, we had a unique requirement to crawl PDF meta data for multiple languages so that we could pull that data into our Sitecore Search index.
There were a couple of unique requirements:
- All of the PDF links and meta data lived in an external system outside of our control
- There were PDFs in 40 plus languages
The Solution
In a previous post we walked through the steps of creating a multi language data crawling Source
in Sitecore Search. We will not dive into all the details of setting up locales and adding them to your source, but please review this post for more background information as that is required for this post.
Sitecore has some specific documentation about recommended approaches for crawling PDF content here (opens in a new tab). Unfortunately, our PDFs did not fit well into this scenario. We had no patterns or common data we could extract. We had to come up with another solution and since Sitecore Search is so flexible it was not a problem.
In our case we were able to work with the client and their external data provider to produce a nightly built html table with all the necessary PDF meta data needed for the index and Sitecore Search. Each row had a link to the PDF and all the meta data associated with it.
Here is an example of said listing page with all of our PDF links and meta data:
The html each had a row of data that looked this this:
First, we set up our JS Trigger. In our scenario we had a couple static html pages of the same data. The data was segmented, we had to combine and crawl them all in our one source in Sitecore Search.
We set up our Trigger so that we could pass a locale (language) to each url so that we could properly set the locale for that given content.
function extract() {
var arr = [];
var baseUrls = [
"https://website.com/assets/Manuals.html",
"https://website.com/assets/Documents.html",
"https://website.com/assets/Diagrams.html"
];
var locales = ['fr_fr', 'en_us', 'pt_br', 'es_mx'];
for (var i = 0; i < locales.length; i++) {
var locale = locales[i];
for (var j = 0; j < baseUrls.length; j++) {
var url = `${baseUrls[j]}?locale=${locale}`;
arr.push({ "url": url });
}
}
return arr;
}
The next step was to set up our Locale Extractor, below. This is what actually sets the locale (language) in the index for that given item.
function extract(request, response) {
locales = ['fr_fr', 'en_us', 'pt_br', 'es_mx'];
for (idx in locales) {
locale = locales[idx];
if (request.url.indexOf('locale=' + locale) >= 0) {
return locale;
}
}
return "en_us";
}
The final step was to set up our JS Document Extractor, as shown below. This will crawl the html body of the links we defined above and get each row of PDF data into our data extraction.
Note that locale is present in the code, we look at the locale and make sure it is the current locale that we are indexing. Only then do we add that content to the index.
function extract(request, response) {
$ = response.body;
main_locale = request.url.split('=').pop();
$rows = $('#mc-main-content > table').find('tr');
if ($rows == null) {
return []
}
function filterEmptyStrings(array) {
return array.filter(str => str !== null && str.trim() !== '');
}
const out = $rows
.map((i, elem) => {
//skipping header
if (i != 0) {
var description = $(elem).find("td:nth-child(5)").text()
var filename = $(elem).find("td:nth-child(3)").text()
var asset = $(elem).find("td:nth-child(7)").text()
var category = $(elem).find("td:nth-child(8)").text()
var product = $(elem).find("td:nth-child(9)").text()
var url = $(elem).find("td:nth-child(10)").text()
var date = $(elem).find("td:nth-child(6)").text()
var locale = $(elem).find("td:nth-child(11)").text()
if (locale != main_locale) {
return []
}
return {
id: url.replaceAll(/[^a-zA-Z\d_-]/g, "_"),
url: url,
type: "Documentation / Software",
name: filename,
description: description,
assettype: filterEmptyStrings(asset?.split(",") || []),
category: filterEmptyStrings(category?.split(",") || []),
product: filterEmptyStrings(product?.split(",") || []),
date: new Date(date).getTime(),
source: "Product Documentation",
};
}
})
.get();
return out;
}
We can go to the Content page in Sitecore Search and confirm data has been crawled and is in our index.
Conclusion
Sitecore Search is a powerful tool. It is a common request to include PDFs and content in multiple languages into your search. While Sitecore offers some guidelines and concepts to get you started, we can customize and set up a solution that works for your needs and requirements. Please leverage this example as a base for that.