{
"https://news.mn/en": {
"favicon_url": "https://news.mn/en/wp-content/uploads/2019/07/NE-16.jpg",
"news_rss_url": "https://news.mn/en/feed/",
"search_url": "https://news.mn/en/search/{}/feed", (1)
"search_url_web": "https://news.mn/en/search/{}", (2)
"type": "XML", (3)
"tags": { … } (4)
}
}
Meta-Press.es
Decentralized search engine & automatized press reviews
How to add a new source to the Meta-Press.es search engine?
- 1. Methodology
- 2. Examples
- 3. Regular expressions:
*_re
- 4. Images:
r_img
- 5. Date formats
- 6. Tags
- 7. Pre-treatment on sources responses:
raw_rep_re
- 8. Gather multiple elements in a templated field:
*_tpl
- 9. Get permission for redirection URL:
redir_url
- 10. Setup a preliminary request:
token_url
- 11. Declare the domain part:
domain_part
- 12. Filter results of approximative sources:
filter_results
- 13. Help
- 14. External doc about CSS, RegEx and XPath
If you are a programmer, you just have to add an entry in the json/sources.json JSON object (or write your entry in the setting panel of the add-on).
This documentation is here to guide you through this process.
Here are useful examples, listed at the top of the json/sources.json file :
-
Mediapart.fr is a good and simple example using "normal" CSS selectors
-
News.mn/en is an example of source providing results in RSS format
-
Nems.mn extends the News.mn/en definition
-
Arret sur Images provides results in JSON
-
The Japan News uses HTTP POST method
-
Helsinky Times uses XPaths to parse some fields
1. Methodology
-
First, verify that the source is not already known from Meta-Press.es, checking:
-
The source list of the extension (opening the "Advanced search" and clicking the "Source list" button)
-
The list of incompatibles sources
-
The list of broken sources
-
-
Then visit the website of the source you want to add and note its main URL (preferably in HTTPS);
-
Find and try its search functionality:
-
check if the results are accessible in RSS (or ATOM) format using the developer’s tools (F12 key, default Inspector tab, search for "rss")
-
in this case : the source is in the
"type": "XML"
case and you don’t need to provide the timezone of the source in the tags
-
-
check that this URL is for results in chronological order, or have the results sorted this way, else the source is an incompatible one, see the admonition block below
-
if the result URL does not contain your search terms, the source might be using the POST HTTP method, you can look at other sources using POST method, such as The Japan News.
-
check that results are really from this request via the developer’s tools : F12 key, Network tab, Response preview. Results can be loaded via JSON and XHR requests see Arret sur Images for example of how to deal with it
-
test the search feature to find out how it works : search for many terms at once and inspect the results, are the results containing all of your search terms, or just one of them ? This will help to decide which tag to put on the source technical tag.
-
note a search term that gives results, among : europe, paris, new, via, 1, 2000. If no one gives results, contact the dev. team.
-
check if the source is providing different type of results :
text
,image
,video
,audio
; in this case, you will be able to create one source entry by result type (it’s easy when you just extend your first source definition)
-
-
Search for it’s main RSS feed
-
Search for it’s favicon, smallest version (32px width for the best)
If something goes wrong, like :
Please provide some feedback to the source about the problem and add it to the list of incompatible sources in the wiki with your feedback effort status. You can also help by contacting sources of this list with no feedback yet. |
Then, to write the source definition, there are 4 kinds of information to provide :
-
general info: name, timezone and tags at the end;
-
headlines: one entry to point at the main RSS feed of the source;
-
search: the source search URL (which provides the results);
-
result parsing: 5 more entries to retrieve specific elements of each result (the last two being facultative):
-
title:
r_h1
-
link:
r_url
-
date:
r_dt
-
extract:
r_txt
-
author:
r_by
-
Each of these entries can be followed by an _attr
and _re
version of it. In
the first case it allows the targeting of a specific HTML or HTML-node
JavaScript [1] attribute of the designated HTML element, or to
apply a
.replace()
on it. The _re
needs a list of two strings : the first being a regular
expression and the second a replacement pattern (see example below).
It’s also possible to give an _xpath
version to use
XPaths instead of
CSS
selectors.
1.1. Checklist for newly created sources
Once a source seems working, here are some common points to check before contributing the source the development team. There are defects not seen at first sight when a source is giving results :
-
is the
news_rss_url
field filled ? (is it applicable, is there a news feed provided by the source ?) -
is the
res_nb
field filled ? (does the source display its total number of available results ? This information is unfortunately missing on"type": "XML"
sources) -
is the search results sorted by date by the source ?
-
are results in Meta-Press.es leading to the correct page on the source ?
-
are the tags : many words, one word or approx attributed ?
-
To check is a source deserves the many words tag: make a search with one search term (and note the amount of available results), then search for the 1st term and one more, if there are more less available results in the 2nd search (or no results at all if your 2nd search term is unknown, like: zorglub), the source is seemingly a many words crossing terms instead of adding them.
-
To check that a source is one word: try an ambiguous search (like "bid") and check if the results are strictly about "bid" or contains also approximative forms of your term (like "biden")
-
If your source is an approx (because it fails the 2 previous tests) you may still improve it with the filter_results field (see corresponding paragraph).
-
-
if the article content of this source is readable without any subscription you can add the access content tag, if you read it from Meta-Press.es you can add direct content
-
if the source has illustrations, you should check that the
alt=
attribute of the<img …
tag is filled, or you can add ther_img_alt
field (and point it on the main title if you don’t have a better content available).
2. Examples
2.1. RSS based source
1 | In this URL, the {} will be replaced by Meta-Press.es with your search terms. |
2 | This 2nd URL allows to redirect the user to the source online result page, for instance to go deeper into this source. |
3 | When there is no type entry in the source definition, results are provided in HTML. Here we precise that the source is responding in XML , but it could also be JSON . |
4 | tags are explained later |
2.2. Extend your own source definitions: extends
"https://news.mn": {
"extends": "https://news.mn/en", (1)
"news_rss_url": "https://news.mn/feed/",
"search_url": "https://news.mn/search/{}/feed",
"search_url_web": "https://news.mn/?s={}",
"tags": { … }
}
1 | Here "https://news.mn/en" is the key of entry to extend. |
In this case, a copy of the extended source (here it’s
https://news.mn/en
) is used and completed with the provided elements of
the new source https://news.mn
.
If you need to remove an element coming from the extended source, you can set
it with the null
value in the new source definition. This way it won’t be
erased and won’t be considered by Meta-Press.es
2.3. JSON based source definition
To diagnose an AJAX result loading case, it’s possible to use the Firefox’s developer’s tools.
The F12 key allow to open those tools, and then we can click on the Console
tab.
The XHR requests are that occur after the initial page loading are listed here.
Each requests can be inspected in the console, including the JSON response payload.
If the inspected request contains your search results then you already get its address and then you can determine the JSON paths to reach each wanted information.
"https://www.arretsurimages.net": {
"favicon_url": "https://www.arretsurimages.net/assets/img/favicon/favicon-32x32.png",
"news_rss_url": "https://api.arretsurimages.net/api/public/rss/all-content",
"type": "JSON",
"search_url": "https://api.arretsurimages.net/api/public/search?q={}&sort=last_version_at&limit={<100}", (1)
"search_url_web": "https://www.arretsurimages.net",
"res_nb": "hits -> total", (2)
"results": "hits -> hits", (3)
"r_h1": "title -> text",
"r_url": "path",
"r_dt": "last_version_at",
"r_txt": "tease",
"r_by": "authors", (4)
"r_by_attr": "name", (4)
…
1 | {<100} will be replaced by Meta-Press.es with the max number of results by request. You can change this parameter in the settings but it won’t exceed 100 in this case because the source would refuse to anwser. |
2 | When parsing JSON objects, you can specify a path (separated with : " → " spaced arrow signs) to point deep values (not a 1st level). |
3 | This JSON path point to the list of results that Meta-Press.es will go through. |
4 | The r_by property point at a JSON list, and the r_by_attr designate the attribute to fetch from each elements of the list. Then names are joined with comas between them to build the list of authors as a single field. |
2.3.1. json_to_html
Results might also be sent as valid HTML embedded in a JSON object.
In this case you can specify a json_to_html
JSON path in the source
definition to point at specific JSON location where a valid HTML string will be
found and parsed.
For the moment only one JSON location can be parsed as HTML, but it might get thinner with per-field based conversion (r_h1_html, r_url_html…).
2.4. CSS based source definition
"https://www.mediapart.fr": {
"favicon_url": "https://www.mediapart.fr/assets/front/favicon/journal/favicon-32x32.png",
"news_rss_url": "https://www.mediapart.fr/articles/feed",
"search_url": "https://www.mediapart.fr/search?search_word={}&sort=date&order=desc",
"res_nb": ".sub-title",
"res_nb_re": [
"^(\\d+?) ",
"$1"
], (1)
"results": "ul.search > li", (2)
"r_h1": "h2",
"r_url": "h2 > a",
"r_url_attr": "href", (3)
"r_dt": ".author",
"r_dt_fmt_1": [
"\\s(\\d+?)[ermè]*? (.+?) (\\d{4})",
"$3-{$2}-$1"
], (4)
"r_txt": "p",
"r_by": ".author a[rel=author]",
"tags": { … }
},
1 | res_nb can also use a _re complementary entry, here it extracts a number at the beginning of a line |
2 | It’s this CSS expression that allows to extract the results from the web page. It’s directly pointing at the results collection, that will be grabbed via querySelectorAll() . Note that we used a strict CSS selector (with > ) to ensure we don’t grab unwanted elements from elsewhere on the page. |
3 | r_url_attr allows to get the href attribute value |
4 | r_dt_fmt_1 :
|
2.5. HTTP POST based source definition
"https://the-japan-news.com": {
"favicon_url": "https://the-japan-news.com/favicon.ico",
"method": "POST", (1)
"body": "siteSearchInput={}&x=7&y=11&span=365", (2)
"search_url": "https://the-japan-news.com/news/search",
…
"r_dt": "time", (3)
"r_dt_attr": "datetime", (3)
…
}
1 | In addition to the usual search_url , we need to set the POST method |
2 | And a body for the request, which is the GET equivalent for query string. This is called application/x-www-form-urlencoded format. It might also be JSON, and in this case you’ll have to specify a search_ctype entry with 'application/json' content. It also exists multipart/form-data (used by LaVie.fr for instance). |
3 | Here we can note that when a <time datetime=""> HTML tag is available, it’s preferable to use it to avoid this regular expression format step, and to avoid having a timezone to define in the tags. |
2.6. XPath based source definition: *_xpath
XPath is a very powerful language and it can be used in replacement of every CSS selectors.
"https://www.helsinkitimes.fi": {
"favicon_url": "https://www.helsinkitimes.fi/templates/ja_teline_v/favicon.ico",
"news_rss_url": "https://www.helsinkitimes.fi/?format=feed&type=rss",
"search_url": "https://www.helsinkitimes.fi/search1332318146.html?searchword={}&ordering=newest&searchphrase=all",
"res_nb": ".searchintro .bagde",
"results": ".result-title",
"r_h1": "a",
"r_url": "a",
"r_url_attr": "href",
"r_dt_xpath": "./following-sibling::dd[@class='result-created'][1]/strong", (1)
"r_txt_xpath": "./following-sibling::dd[@class='result-text'][1]",
"r_by_xpath": "./following-sibling::dd[@class='result-category'][1]/span",
"tags": { … }
}
1 | Instead of a regular r_dt field, here we have a r_dt_xpath field. So it’s a not a CSS selector but an XPath definition that follows. Here it allows to reach the next sibling element relatively to the current one, which is not possible via CSS. |
One can also note that :
-
Finding an element based on the fact it contains text, or base on the text it contains
-
Reaching parent elements
-
Reaching previous elements
-
Reaching HTML comment nodes
-
XPath is also needed when XML namespaces are involved (like in most encountered RSS feeds extended with Dublin Core DTD).
3. Regular expressions: *_re
Regular expression are a complex subject. Here are some documentation again. If you have already work with RegEx here are some key points to keep in mind :
-
patterns need to be delimited with knows elements before and after what you want to extract :
"\\s(\\d+?) "
here there is a space (or a tab) before and a space after. -
you mainly need :
\\d+?
\\w+?
\\s+?
(to match : numbers, words, and any kind of spaces) -
then you’ll mostly use :
()
()?
(?:)
(to extract the pattern between parenthesis, with a?
after if the pattern might be missing, and with?:
inside at the beginning to avoid extracting this group, no corresponding"$1"
/"$2"
in the replacement pattern).
4. Images: r_img
The integration of images in Meta-Press.es results is possible via the fields :
r_img_src
, r_img_alt
and r_img_title
.
r_img
is a shorthand that allows to directly retrieve all the fields of an
image if it point on an <img …
HTML tag with an src
attribute (and
optionally alt
and title
attributes.
with a CSS or XPath selector and to integrate them directly without any additional processing in the case the images source is well informed in the src attribute (the alternative text and the title, optional, respectively in alt and title attributes)
If it’s not the case (as for Euronews where the information is stored in
other attributes like data-src
, data-alt
, data-title
, or Die Press
where the information is stored in different HTML tags) it is possible to
complete the definition of images with r_img_src
, r_img_alt
et
r_img_title
fields and even r_img_src_attr
, r_img_alt_attr
and
r_img_title_attr
.
For JSON sources with images (such as La Croix or Les Echos), r_img
is
useless, and r_img_src
is mandatory and it’s advised to add r_img_alt
and
r_img_title
if the information is available.
It is possible as well to use regular expressions on these fields with re
(ex.
_El Mercurio (fotos)) or templates with tpl
(ex. _Les Echos).
5. Date formats
Meta-Press.es supports every date format accepted by new Date('date_string')
and the english relative dates like 3 minutes ago, 8 hours ago or even
today and yesterday.
5.1. Languages
For sources of other languages, the date have to be converted in one of the
supported formats (it’s generally the ISO 8601 format yyyy/mm/dd hh:mm:ss tz
that
is used).
5.2. Multiple formats: r_dt_fmt_*
Then, as sources may use different date formats (based on results age) you can
specify multiple date formats named : r_dt_fmt_1
r_dt_fmt_2
…
Those formats are RegEx replacement patterns, and they are tried one after another until a valid date comes out.
5.3. TimeZones: tz
Else, using the toLocaleTimeString()
function, all the dates are normalized
regarding their time-zones by Meta-Press.es (function timezoned_date()
in
js/BOM_utils.js
) using the "tz"
entry of the "tags"
, if provided, when the
information is not already included in the grabbed date format. A native
JavaScript API would be welcome in this area.
5.4. Month name conversion
As shown in the CSS based source definition example, you can get a month name
converted in its number putting it between curly braces : "$3-{$2}-$1"
.
But if your date is written in English in a Japanese newspaper you’ll have to
set a date_locale
entry in the tags
to get correct month name conversions.
A date_locale
is used for instance, by Arabnews.jp.
Corriere della Sera is also using one, with a special browser
value to
indicate that the newspaper is serving dates using the user’s browser locale.
6. Tags
6.1. Mandatory tags
Here is an example of the mandatory tags to define for a source. You are to adapt the content of each tag to the reality of your new source :
"tags": {
"name": "Mediapart.fr",
"lang": "pt", (1)
"country": "br", (2)
"themes": [
"general",
"politics"
],
"tech": [ (3)
"one word",
"fast"
],
"src_type": [ (4)
"Press",
"Reference Press"
],
"res_type": [ (5)
"text",
"image"
]
…
1 | The digram of the source language following the ISO 639 norm. Here we pretend Mediapart is from Brazil to highlight the difference between lang and country. |
2 | The digram of the country following the ISO 3166 norm. Here we pretend Mediapart is from Brazil to highlight the difference between lang and country. |
3 | Technical tags mostly work by pairs :
|
4 | You can refer yourself to the Meta-Press.es main search interface to find the list of the used source types |
5 | You can refer yourself to the Meta-Press.es main search interface to find the list of the used result types |
6.2. Optional tags
…
"tz": "Europe/Paris", (1)
"charset": "gb2312", (2)
"date_locale": "en" (3)
}
1 | The timezone tz tag is only needed if the date of the results have no timezone in it. |
2 | The charset tag is only needed when the source is not serving its web pages in UTF8. |
3 | The date_locale tag is only needed if you have to get a month name converted in its number but the date is not written in the same language than the rest of the newspaper. |
7. Pre-treatment on sources responses: raw_rep_re
Results might be sent an hybrid and invalid format.
In this case it’s possible to specify a regex replacement pattern to extract useful data from the source server’s answer (the JSON included in a JSONP file, or a RSS file without invalid headers for instance).
This feature is currently only available for JSON and XML source types.
Several third party search services (like Queryly and Algolia) allow to setup a callback from the GET HTTP request, its the name of a JavaScript function that will be called with the data as argument. When this parameter is set, the results arrive as a JSONP script and so you can simply remove the parameter from the original request to receive clean JSON instead.
8. Gather multiple elements in a templated field: *_tpl
Sometimes the information you want to display in a result field are scattered at many places in the result page (or JSON). In this case you can list multiple elements to grab and a template string to stitch them together.
To do this, define a list of elements (JSON paths or CSS selectors…) for the
specified field (such as : r_txt
), and to add an r_txt_tpl
entry defining a
string where you can put replacement tokens like $1
, $2
… which will be
replaced by the respective values of the elements of the list.
"r_txt": ["description", "city"]
"r_txt_tpl": "$1 ($2)"
Furthermore, you can define an r_txt_attr
with a list of attributes to be
retrieved.
To finish, if the last attribute name is missing in the list, the textContent
of the last element will be retrieved instead.
Check arretsurimages.net or publicsenat.fr for examples.
9. Get permission for redirection URL: redir_url
A source may need to perform an HTTP redirection to actually serve results. If
it’s possible to target directly the 2e URL, it’s still the simplest way. But
if it’s not possible, like with LeSoir.be, one will have to add a redir_url
field in its source definition. Meta-Press.es will then ask for the Host
Permission of this domain too (at search time).
10. Setup a preliminary request: token_url
Some sources need a token (generated with their regular search form page for instance) to serve results. Others need a preliminary request to setup the next response language…
In those cases it’s possible to define a preliminary request that Meta-Press.es will perform before the regular search.
This request can be tuned as finely as regular search requests :
"token": {
"token_url": "https://www.…",
"token_method": "POST",
"token_ctype": "application/json",
"token_body": "{\\"lang\\": \\"fr\\"}"
},
A token_sel
field also exists and allows to extract an element from the page
to inject it in the search URL of the source via a replacement token "{T}".
"token": {
"token_url": "https://www. … .com/search",
"token_sel": "body > script:nth-of-type(8)",
"token_sel_re": [
"\"searchToken\":\"([^\"]+)\"",
"$1"
]
},
"search_url": "https://www. … .com/search/api?qs={}&t={T}&sortBy=date",
11. Declare the domain part: domain_part
If a source is using relative URL in its href attributes those URL will be
completed with a prefix containing the source domain. Unfortunately, if the
correct path contains additional subfolders, you will have to specify which
"domain_part"
to use to complete relative URLs via a dedicated field in the
source definition. It looks like this :
"domain_part": "http://china.dailynk.com/chinese",
12. Filter results of approximative sources: filter_results
Some sources return approximate results (not containing exactly your search terms), they are tagged "approx".
Among those sources some display (for each result) the portion of the text that matches your terms (or their approximate version). So we can verify, for those sources, if results really matches your search terms, and to keep only good results. To trigger this behavior, add the following entry in your source definition :
"filter_results": true,
13. Help
If you still have questions about how to add sources to Meta-Press.es after you have read all this documentation, you can ask us :
-
in an "issue" of the Framagit code repository of Meta-Press.es ;
-
via IRC : #meta-press.es@geeknode.org (if possible on office time) ;
-
or via e-mail : contact /\ meta-press.es
14. External doc about CSS, RegEx and XPath
14.1. JSON
JSON syntax at Mozilla Developer Network and json.org : just keep in mind that only double quotes are allowed, and no trailing comas
14.2. CSS selectors
Mozilla Developer Network about CSS selectors
More documentation on CSS selectors from medium.com
14.3. Regular Expressions
innerHTML
for instance, and parsing the HTML comments