Sitecore Search - Multi Language Data Crawling
Summary
Sitecore Search is an AI-driven, headless content discovery platform that enables you to create search experiences across multiple sources.
Many times we will crawl content that is outside of the website that the search itself lives on. Sitecore Search allows us the flexibility to crawl many types of web content.
A common use case for search is to index content across multiple languages. We recently had a project where we needed to crawl content on multiple sitemaps across multiple languages into one source in Sitecore Search.
This blog post walks through how we achieved that.
The Solution
The first step for us was to identify the correct languages we were indexing. Under the Domain Settings
page in Sitecore Search make sure we have those locales, or languages, added.
Create a new Web Crawler
source and make sure those locales are selected.
Next, we set up a Trigger
of the type Sitemap Index
. We added each url of the individual sitemaps for each language.
We then set up our Document Extractor
:
The Document Extractor
and Locale Extractor
is what runs when the Trigger
is called. The Sitemap is crawled and the data and locale, per entry in the sitemap, is extracted.
Our html pages we crawled had meta tags with the following:
<html class="no-js" lang="en_us">
<title>Title | Homepage</title>
<meta name="category" content="CategoryA,CategoryB">
<meta name="product" content="Product1">
<meta name="asset-type" content="Webpage">
<meta name="solution" content="Solution1,Solution2,Harvesting,Solution3">
<meta name="date" content="2022-05-31T15:35:26-05:00">
<meta name="description" content="This is some example description info.">
<meta content="https://examplesite.com/products/123" property="og:url">
This post assumes familiarity with attributes, or index fields, in Sitecore Search. Please review the official documentation here (opens in a new tab) to make sure you have the right fields added to match the data fields you are extracting.
Certain meta data fields had multiple values, such as product and category, so we needed to set up our extractor to allow arrays of data for certain fields.
We can hard code values if needed, otherwise we get all the data from the meta data above. Below is a sample of the extractor code:
function extract(request, response) {
$ = response.body;
function filterEmptyStrings(array) {
if (array === null) {
return [];
}
return array.filter(str => str !== null && str.trim() !== '');
}
return [{
'name': $('title').text(),
'description': $('meta[name="description"]').attr('content'),
'category': filterEmptyStrings($('meta[name="category"]').attr('content')?.split(",") || []),
'product': filterEmptyStrings($('meta[name="product"]').attr('content')?.split(",") || []),
'assettype': filterEmptyStrings($('meta[name="asset-type"]').attr('content')?.split(",") || []),
'solution': filterEmptyStrings($('meta[name="solution"]').attr('content')?.split(",") || []),
'date': new Date($('meta[name="date"]').attr('content')).getTime(),
'url': $('meta[property="og:url"]').attr('content'),
"type": "Website",
}];
}
Last, we set up the Locale Extractor
of type JS
to make sure the extractor is setting properly based on the language of each item being indexed.
The code was as follows:
function extract(request, response) {
$ = response.body;
locale = $('html').attr('lang');
if (locale == 'es_es') {
return 'es_es';
}
if (locale == 'de_de') {
return 'de_de';
}
if (locale == 'pt_pt') {
return 'pt_pt';
}
if (locale == 'nl_nl') {
return 'nl_nl';
}
if(locale == 'fr_fr'){
return 'fr_fr'
}
//fallback
return "en_us";
}
We can go to the Content page in Sitecore Search and confirm data has been crawled and is in our index:
Then, we can select a result and see our data extracted:
Conclusion
Sitecore Search is flexible and allows us to index content in multiple languages. This simple example shows a straightforward approach to getting content across multiple languages into your Search source and index.