Meta-Press.es

Decentralized search engine & automatized press reviews

How to add a new source to the Meta-Press.es search engine?

lang.: fr

Index

1. Methodology
- 1.1. Checklist for newly created sources
2. Examples
3. Regular expressions: *_re
4. Images: r_img
5. Date formats
6. Tags
- 6.1. Mandatory tags
- 6.2. Optional tags
7. Pre-treatment on sources responses: raw_rep_re
8. Gather multiple elements in a templated field: *_tpl
9. Get permission for redirection URL: redir_url
10. Setup a preliminary request: token_url
11. Declare the domain part: domain_part
12. Filter results of approximative sources: filter_results
13. Help
14. External doc about CSS, RegEx and XPath

If you are a programmer, you just have to add an entry in the json/sources.json JSON object (or write your entry in the setting panel of the add-on).

This documentation is here to guide you through this process.

Here are useful examples, listed at the top of the json/sources.json file :

Mediapart.fr is a good and simple example using "normal" CSS selectors
News.mn/en is an example of source providing results in RSS format
Nems.mn extends the News.mn/en definition
Arret sur Images provides results in JSON
The Japan News uses HTTP POST method
Helsinky Times uses XPaths to parse some fields

1. Methodology

First, verify that the source is not already known from Meta-Press.es, checking:
- The source list of the extension (opening the "Advanced search" and clicking the "Source list" button)
- The list of incompatibles sources
- The list of broken sources
Then visit the website of the source you want to add and note its main URL (preferably in HTTPS);
Find and try its search functionality:
- check if the results are accessible in RSS (or ATOM) format using the developer’s tools (F12 key, default Inspector tab, search for "rss")
  - in this case : the source is in the "type": "XML" case and you don’t need to provide the timezone of the source in the tags
- check that this URL is for results in chronological order, or have the results sorted this way, else the source is an incompatible one, see the admonition block below
- if the result URL does not contain your search terms, the source might be using the POST HTTP method, you can look at other sources using POST method, such as The Japan News.
- check that results are really from this request via the developer’s tools : F12 key, Network tab, Response preview. Results can be loaded via JSON and XHR requests see Arret sur Images for example of how to deal with it
- test the search feature to find out how it works : search for many terms at once and inspect the results, are the results containing all of your search terms, or just one of them ? This will help to decide which tag to put on the source technical tag.
- note a search term that gives results, among : europe, paris, new, via, 1, 2000. If no one gives results, contact the dev. team.
- check if the source is providing different type of results : text, image, video, audio ; in this case, you will be able to create one source entry by result type (it’s easy when you just extend your first source definition)
Search for it’s main RSS feed
Search for it’s favicon, smallest version (32px width for the best)

If something goes wrong, like :

no search functionality
no date on results
no date sort

Please provide some feedback to the source about the problem and add it to the list of incompatible sources in the wiki with your feedback effort status.

You can also help by contacting sources of this list with no feedback yet.

Then, to write the source definition, there are 4 kinds of information to provide :

general info: name, timezone and tags at the end;
headlines: one entry to point at the main RSS feed of the source;
search: the source search URL (which provides the results);
result parsing: 5 more entries to retrieve specific elements of each result (the last two being facultative):
- title: r_h1
- link: r_url
- date: r_dt
- extract: r_txt
- author: r_by

Each of these entries can be followed by an _attr and _re version of it. In the first case it allows the targeting of a specific HTML or HTML-node JavaScript ^[1] attribute of the designated HTML element, or to apply a .replace() on it. The _re needs a list of two strings : the first being a regular expression and the second a replacement pattern (see example below).

It’s also possible to give an _xpath version to use XPaths instead of CSS selectors.

1.1. Checklist for newly created sources

Once a source seems working, here are some common points to check before contributing the source the development team. There are defects not seen at first sight when a source is giving results :

is the news_rss_url field filled ? (is it applicable, is there a news feed provided by the source ?)
is the res_nb field filled ? (does the source display its total number of available results ? This information is unfortunately missing on "type": "XML" sources)
is the search results sorted by date by the source ?
are results in Meta-Press.es leading to the correct page on the source ?
are the tags : many words, one word or approx attributed ?
- To check is a source deserves the many words tag: make a search with one search term (and note the amount of available results), then search for the 1st term and one more, if there are more less available results in the 2nd search (or no results at all if your 2nd search term is unknown, like: zorglub), the source is seemingly a many words crossing terms instead of adding them.
- To check that a source is one word: try an ambiguous search (like "bid") and check if the results are strictly about "bid" or contains also approximative forms of your term (like "biden")
- If your source is an approx (because it fails the 2 previous tests) you may still improve it with the filter_results field (see corresponding paragraph).
if the article content of this source is readable without any subscription you can add the access content tag, if you read it from Meta-Press.es you can add direct content
if the source has illustrations, you should check that the alt= attribute of the <img … tag is filled, or you can add the r_img_alt field (and point it on the main title if you don’t have a better content available).

2. Examples

2.1. RSS based source

{
	"https://news.mn/en": {
		"favicon_url": "https://news.mn/en/wp-content/uploads/2019/07/NE-16.jpg",
		"news_rss_url": "https://news.mn/en/feed/",
		"search_url": "https://news.mn/en/search/{}/feed", (1)
		"search_url_web": "https://news.mn/en/search/{}", (2)
		"type": "XML", (3)
		"tags": { … } (4)
	}
}

1	In this URL, the `{}` will be replaced by Meta-Press.es with your search terms.
2	This 2nd URL allows to redirect the user to the source online result page, for instance to go deeper into this source.
3	When there is no `type` entry in the source definition, results are provided in HTML. Here we precise that the source is responding in `XML`, but it could also be `JSON`.
4	`tags` are explained later

2.2. Extend your own source definitions: `extends`

	"https://news.mn": {
		"extends": "https://news.mn/en", (1)
		"news_rss_url": "https://news.mn/feed/",
		"search_url": "https://news.mn/search/{}/feed",
		"search_url_web": "https://news.mn/?s={}",
		"tags": { … }
	}

1	Here "https://news.mn/en" is the key of entry to extend.

In this case, a copy of the extended source (here it’s https://news.mn/en) is used and completed with the provided elements of the new source https://news.mn.

If you need to remove an element coming from the extended source, you can set it with the null value in the new source definition. This way it won’t be erased and won’t be considered by Meta-Press.es

2.3. JSON based source definition

To diagnose an AJAX result loading case, it’s possible to use the Firefox’s developer’s tools. The F12 key allow to open those tools, and then we can click on the Console tab. The XHR requests are that occur after the initial page loading are listed here. Each requests can be inspected in the console, including the JSON response payload.

If the inspected request contains your search results then you already get its address and then you can determine the JSON paths to reach each wanted information.

	"https://www.arretsurimages.net": {
    "favicon_url": "https://www.arretsurimages.net/assets/img/favicon/favicon-32x32.png",
    "news_rss_url": "https://api.arretsurimages.net/api/public/rss/all-content",
    "type": "JSON",
    "search_url": "https://api.arretsurimages.net/api/public/search?q={}&sort=last_version_at&limit={<100}", (1)
    "search_url_web": "https://www.arretsurimages.net",
    "res_nb": "hits -> total", (2)
    "results": "hits -> hits", (3)
    "r_h1": "title -> text",
    "r_url": "path",
    "r_dt": "last_version_at",
    "r_txt": "tease",
    "r_by": "authors", (4)
    "r_by_attr": "name", (4)
		…

1	`{<100}` will be replaced by Meta-Press.es with the max number of results by request. You can change this parameter in the settings but it won’t exceed 100 in this case because the source would refuse to anwser.
2	When parsing JSON objects, you can specify a path (separated with : " → " spaced arrow signs) to point deep values (not a 1st level).
3	This JSON path point to the list of results that Meta-Press.es will go through.
4	The `r_by` property point at a JSON list, and the `r_by_attr` designate the attribute to fetch from each elements of the list. Then names are joined with comas between them to build the list of authors as a single field.

2.3.1. `json_to_html`

Results might also be sent as valid HTML embedded in a JSON object.

In this case you can specify a json_to_html JSON path in the source definition to point at specific JSON location where a valid HTML string will be found and parsed.

For the moment only one JSON location can be parsed as HTML, but it might get thinner with per-field based conversion (r_h1_html, r_url_html…).

2.4. CSS based source definition

	"https://www.mediapart.fr": {
		"favicon_url": "https://www.mediapart.fr/assets/front/favicon/journal/favicon-32x32.png",
		"news_rss_url": "https://www.mediapart.fr/articles/feed",
		"search_url": "https://www.mediapart.fr/search?search_word={}&sort=date&order=desc",
		"res_nb": ".sub-title",
		"res_nb_re": [
			"^(\\d+?) ",
			"$1"
		], (1)
		"results":	"ul.search > li", (2)
		"r_h1": "h2",
		"r_url": "h2 > a",
		"r_url_attr": "href", (3)
		"r_dt": ".author",
		"r_dt_fmt_1": [
			"\\s(\\d+?)[ermè]*? (.+?) (\\d{4})",
			"$3-{$2}-$1"
		], (4)
		"r_txt": "p",
		"r_by": ".author a[rel=author]",
		"tags": { … }
	},

1	`res_nb` can also use a `_re` complementary entry, here it extracts a number at the beginning of a line
2	It’s this CSS expression that allows to extract the results from the web page. It’s directly pointing at the results collection, that will be grabbed via `querySelectorAll()`. Note that we used a strict CSS selector (with `>`) to ensure we don’t grab unwanted elements from elsewhere on the page.
3	`r_url_attr` allows to get the `href` attribute value
4	`r_dt_fmt_1` : Here we capture the date elements to put them in the right order. The month name (pointed by the `{$2}`) will be converted in the correct number. Note that to specify an anti-slash in a JavaScript string, you need to escape it, hence the double anti-slash in `"\\s"` and `"\\d"`. To finish, as the name of this attribute suggests, you can define as much date formats as used by the source (for instance if the source is using relative date formats "1h ago" in addition to the absolute one "2022-03-21").

2.5. HTTP POST based source definition

	"https://the-japan-news.com": {
		"favicon_url": "https://the-japan-news.com/favicon.ico",
		"method": "POST", (1)
		"body": "siteSearchInput={}&x=7&y=11&span=365", (2)
		"search_url": "https://the-japan-news.com/news/search",
		…
		"r_dt": "time", (3)
		"r_dt_attr": "datetime", (3)
		…
	}

1	In addition to the usual `search_url`, we need to set the POST method
2	And a body for the request, which is the GET equivalent for query string. This is called `application/x-www-form-urlencoded` format. It might also be JSON, and in this case you’ll have to specify a `search_ctype` entry with `'application/json'` content. It also exists `multipart/form-data` (used by LaVie.fr for instance).
3	Here we can note that when a `<time datetime="">` HTML tag is available, it’s preferable to use it to avoid this regular expression format step, and to avoid having a timezone to define in the tags.

2.6. XPath based source definition: `*_xpath`

XPath is a very powerful language and it can be used in replacement of every CSS selectors.

   "https://www.helsinkitimes.fi": {
		"favicon_url": "https://www.helsinkitimes.fi/templates/ja_teline_v/favicon.ico",
		"news_rss_url": "https://www.helsinkitimes.fi/?format=feed&type=rss",
		"search_url": "https://www.helsinkitimes.fi/search1332318146.html?searchword={}&ordering=newest&searchphrase=all",
		"res_nb": ".searchintro .bagde",
		"results": ".result-title",
		"r_h1": "a",
		"r_url": "a",
		"r_url_attr": "href",
		"r_dt_xpath": "./following-sibling::dd[@class='result-created'][1]/strong", (1)
		"r_txt_xpath": "./following-sibling::dd[@class='result-text'][1]",
		"r_by_xpath": "./following-sibling::dd[@class='result-category'][1]/span",
		"tags": { … }
	}

1	Instead of a regular `r_dt` field, here we have a `r_dt_xpath` field. So it’s a not a CSS selector but an XPath definition that follows. Here it allows to reach the next sibling element relatively to the current one, which is not possible via CSS.

One can also note that :

Finding an element based on the fact it contains text, or base on the text it contains
Reaching parent elements
Reaching previous elements
Reaching HTML comment nodes
XPath is also needed when XML namespaces are involved (like in most encountered RSS feeds extended with Dublin Core DTD).

3. Regular expressions: `*_re`

Regular expression are a complex subject. Here are some documentation again. If you have already work with RegEx here are some key points to keep in mind :

patterns need to be delimited with knows elements before and after what you want to extract : "\\s(\\d+?) " here there is a space (or a tab) before and a space after.
you mainly need : \\d+? \\w+? \\s+? (to match : numbers, words, and any kind of spaces)
then you’ll mostly use : () ()? (?:) (to extract the pattern between parenthesis, with a ? after if the pattern might be missing, and with ?: inside at the beginning to avoid extracting this group, no corresponding "$1" / "$2" in the replacement pattern).

4. Images: `r_img`

The integration of images in Meta-Press.es results is possible via the fields : r_img_src, r_img_alt and r_img_title.

r_img is a shorthand that allows to directly retrieve all the fields of an image if it point on an <img … HTML tag with an src attribute (and optionally alt and title attributes.

with a CSS or XPath selector and to integrate them directly without any additional processing in the case the images source is well informed in the src attribute (the alternative text and the title, optional, respectively in alt and title attributes)

If it’s not the case (as for Euronews where the information is stored in other attributes like data-src, data-alt, data-title, or Die Press where the information is stored in different HTML tags) it is possible to complete the definition of images with r_img_src, r_img_alt et r_img_title fields and even r_img_src_attr, r_img_alt_attr and r_img_title_attr.

For JSON sources with images (such as La Croix or Les Echos), r_img is useless, and r_img_src is mandatory and it’s advised to add r_img_alt and r_img_title if the information is available.

It is possible as well to use regular expressions on these fields with re (ex. _El Mercurio (fotos)) or templates with tpl (ex. _Les Echos).

5. Date formats

Meta-Press.es supports every date format accepted by new Date('date_string') and the english relative dates like 3 minutes ago, 8 hours ago or even today and yesterday.

5.1. Languages

For sources of other languages, the date have to be converted in one of the supported formats (it’s generally the ISO 8601 format yyyy/mm/dd hh:mm:ss tz that is used).

5.2. Multiple formats: `r_dt_fmt_*`

Then, as sources may use different date formats (based on results age) you can specify multiple date formats named : r_dt_fmt_1 r_dt_fmt_2 …

Those formats are RegEx replacement patterns, and they are tried one after another until a valid date comes out.

5.3. TimeZones: `tz`

Else, using the toLocaleTimeString() function, all the dates are normalized regarding their time-zones by Meta-Press.es (function timezoned_date() in js/BOM_utils.js) using the "tz" entry of the "tags", if provided, when the information is not already included in the grabbed date format. A native JavaScript API would be welcome in this area.

5.4. Month name conversion

As shown in the CSS based source definition example, you can get a month name converted in its number putting it between curly braces : "$3-{$2}-$1".

But if your date is written in English in a Japanese newspaper you’ll have to set a date_locale entry in the tags to get correct month name conversions.

A date_locale is used for instance, by Arabnews.jp.

Corriere della Sera is also using one, with a special browser value to indicate that the newspaper is serving dates using the user’s browser locale.

6. Tags

6.1. Mandatory tags

Here is an example of the mandatory tags to define for a source. You are to adapt the content of each tag to the reality of your new source :

"tags": {
	"name": "Mediapart.fr",
	"lang": "pt", (1)
	"country": "br", (2)
	"themes": [
		"general",
		"politics"
	],
	"tech": [ (3)
		"one word",
		"fast"
	],
	"src_type": [ (4)
		"Press",
		"Reference Press"
	],
	"res_type": [ (5)
		"text",
		"image"
	]
	…

1	The digram of the source language following the ISO 639 norm. Here we pretend Mediapart is from Brazil to highlight the difference between lang and country.
2	The digram of the country following the ISO 3166 norm. Here we pretend Mediapart is from Brazil to highlight the difference between lang and country.
3	Technical tags mostly work by pairs : one word or many words depend on the source ability to give results that match one word or all the words of a query/search. If even for one word the source can’t give matching results, the approx tag is used, those sources are usually deceitful with queries about which they haven’t proper answers, but still useful on widely covered subjects. If a source is configured to return results matching the exact given expression (for instance because they have be integrated with quotes around the expression in their search URL) they are tagged exact fast or slow currently depends on whether results are fetched in less than 3 seconds or more. We will live-test this information for more accuracy in the future indep.: if the source is not part of a bigger group with non journalistic activities, nor is own by a state or a company listed on a stock exchange market it can be defined as independent with this tag access content: when the source is true web and you can read the articles without any paywalls direct content: when you can read the content of an article directly in Meta-Press.es (because the source is pushing it along with its results so the result description is in fact the entire article) for kids sources are the only available sources when the "child mode" is activated in the settings. You are encouraged to add also for kids < 9 or for kids > 9 when relevant the broken tag allows to avoid using the source (for instance if it has been reported as defective)
4	You can refer yourself to the Meta-Press.es main search interface to find the list of the used source types
5	You can refer yourself to the Meta-Press.es main search interface to find the list of the used result types

6.2. Optional tags

	…
	"tz": "Europe/Paris", (1)
	"charset": "gb2312", (2)
	"date_locale": "en" (3)
}

1	The timezone tz tag is only needed if the date of the results have no timezone in it.
2	The charset tag is only needed when the source is not serving its web pages in UTF8.
3	The date_locale tag is only needed if you have to get a month name converted in its number but the date is not written in the same language than the rest of the newspaper.

7. Pre-treatment on sources responses: `raw_rep_re`

Results might be sent an hybrid and invalid format.

In this case it’s possible to specify a regex replacement pattern to extract useful data from the source server’s answer (the JSON included in a JSONP file, or a RSS file without invalid headers for instance).

This feature is currently only available for JSON and XML source types.

Several third party search services (like Queryly and Algolia) allow to setup a callback from the GET HTTP request, its the name of a JavaScript function that will be called with the data as argument. When this parameter is set, the results arrive as a JSONP script and so you can simply remove the parameter from the original request to receive clean JSON instead.

8. Gather multiple elements in a templated field: `*_tpl`

Sometimes the information you want to display in a result field are scattered at many places in the result page (or JSON). In this case you can list multiple elements to grab and a template string to stitch them together.

To do this, define a list of elements (JSON paths or CSS selectors…) for the specified field (such as : r_txt), and to add an r_txt_tpl entry defining a string where you can put replacement tokens like $1, $2 … which will be replaced by the respective values of the elements of the list.

	"r_txt": ["description", "city"]
	"r_txt_tpl": "$1 ($2)"

Furthermore, you can define an r_txt_attr with a list of attributes to be retrieved.

To finish, if the last attribute name is missing in the list, the textContent of the last element will be retrieved instead.

Check arretsurimages.net or publicsenat.fr for examples.

9. Get permission for redirection URL: `redir_url`

A source may need to perform an HTTP redirection to actually serve results. If it’s possible to target directly the 2e URL, it’s still the simplest way. But if it’s not possible, like with LeSoir.be, one will have to add a redir_url field in its source definition. Meta-Press.es will then ask for the Host Permission of this domain too (at search time).

10. Setup a preliminary request: `token_url`

Some sources need a token (generated with their regular search form page for instance) to serve results. Others need a preliminary request to setup the next response language…

In those cases it’s possible to define a preliminary request that Meta-Press.es will perform before the regular search.

This request can be tuned as finely as regular search requests :

	"token": {
	  "token_url": "https://www.…",
  	"token_method": "POST",
	  "token_ctype": "application/json",
  	"token_body": "{\\"lang\\": \\"fr\\"}"
	},

A token_sel field also exists and allows to extract an element from the page to inject it in the search URL of the source via a replacement token "{T}".

    "token": {
      "token_url": "https://www. … .com/search",
      "token_sel": "body > script:nth-of-type(8)",
      "token_sel_re": [
        "\"searchToken\":\"([^\"]+)\"",
        "$1"
      ]
    },
		"search_url": "https://www. … .com/search/api?qs={}&t={T}&sortBy=date",

11. Declare the domain part: `domain_part`

If a source is using relative URL in its href attributes those URL will be completed with a prefix containing the source domain. Unfortunately, if the correct path contains additional subfolders, you will have to specify which "domain_part" to use to complete relative URLs via a dedicated field in the source definition. It looks like this :

	"domain_part": "http://china.dailynk.com/chinese",

12. Filter results of approximative sources: `filter_results`

Some sources return approximate results (not containing exactly your search terms), they are tagged "approx".

Among those sources some display (for each result) the portion of the text that matches your terms (or their approximate version). So we can verify, for those sources, if results really matches your search terms, and to keep only good results. To trigger this behavior, add the following entry in your source definition :