Sitecore Search - Multi Language Data Crawling

Jared Arnofsky,Sitecore Search

Summary

Sitecore Search is an AI-driven, headless content discovery platform that enables you to create search experiences across multiple sources.

Many times we will crawl content that is outside of the website that the search itself lives on. Sitecore Search allows us the flexibility to crawl many types of web content.

A common use case for search is to index content across multiple languages. We recently had a project where we needed to crawl content on multiple sitemaps across multiple languages into one source in Sitecore Search.

This blog post walks through how we achieved that.

The Solution

The first step for us was to identify the correct languages we were indexing. Under the Domain Settings page in Sitecore Search make sure we have those locales, or languages, added.

Domain Settings

Create a new Web Crawler source and make sure those locales are selected.

Locale

Next, we set up a Triggerof the type Sitemap Index. We added each url of the individual sitemaps for each language.

Triggers

We then set up our Document Extractor:

Document Extractor

The Document Extractor and Locale Extractor is what runs when the Trigger is called. The Sitemap is crawled and the data and locale, per entry in the sitemap, is extracted.

Our html pages we crawled had meta tags with the following:

<html class="no-js" lang="en_us">
<title>Title | Homepage</title>
<meta name="category" content="CategoryA,CategoryB">
<meta name="product" content="Product1">
<meta name="asset-type" content="Webpage">
<meta name="solution" content="Solution1,Solution2,Harvesting,Solution3">
<meta name="date" content="2022-05-31T15:35:26-05:00">
<meta name="description" content="This is some example description info.">
<meta content="https://examplesite.com/products/123" property="og:url">

This post assumes familiarity with attributes, or index fields, in Sitecore Search. Please review the official documentation here (opens in a new tab) to make sure you have the right fields added to match the data fields you are extracting.

Certain meta data fields had multiple values, such as product and category, so we needed to set up our extractor to allow arrays of data for certain fields.

We can hard code values if needed, otherwise we get all the data from the meta data above. Below is a sample of the extractor code:

function extract(request, response) {
    $ = response.body;

    function filterEmptyStrings(array) {
        if (array === null) {
            return [];
        }
        return array.filter(str => str !== null && str.trim() !== '');
    }

    return [{
        'name': $('title').text(),
        'description': $('meta[name="description"]').attr('content'),
        'category': filterEmptyStrings($('meta[name="category"]').attr('content')?.split(",") || []),
        'product': filterEmptyStrings($('meta[name="product"]').attr('content')?.split(",") || []),
        'assettype': filterEmptyStrings($('meta[name="asset-type"]').attr('content')?.split(",") || []),
        'solution': filterEmptyStrings($('meta[name="solution"]').attr('content')?.split(",") || []),
        'date': new Date($('meta[name="date"]').attr('content')).getTime(),
        'url': $('meta[property="og:url"]').attr('content'),
        "type": "Website",
    }];
}

Last, we set up the Locale Extractor of type JS to make sure the extractor is setting properly based on the language of each item being indexed.

The code was as follows:

function extract(request, response) {
    $ = response.body;
    locale = $('html').attr('lang');

    if (locale == 'es_es') {
        return 'es_es';
    }
    if (locale == 'de_de') {
        return 'de_de';
    }
    if (locale == 'pt_pt') {
        return 'pt_pt';
    }
    if (locale == 'nl_nl') {
        return 'nl_nl';
    }
    if(locale == 'fr_fr'){
        return 'fr_fr'
    }
    
    //fallback
    return "en_us";
}

We can go to the Content page in Sitecore Search and confirm data has been crawled and is in our index:

Content

Then, we can select a result and see our data extracted:

Data

Conclusion

Sitecore Search is flexible and allows us to index content in multiple languages. This simple example shows a straightforward approach to getting content across multiple languages into your Search source and index.