Back in December 2013, I was working on a prototype of what would become Meta-Press.es and I stumbled upon a difficulty : international date parsing.
If JavaScript provides a range of accepted English date formats to create a
new Date()
, when you have french dates as input you’ll have to choose whether
to convert those dates into English or ISO format, or if you install a heavy
[1] library based on human contributed translations of formats
(DateJS, Moment.js).
As a quick fix, I piled up 12 regular expressions and had my answer for french month names. Thinking twice about it English and french month names are merely the same and it was easy to improve the regular expressions to also match English names and even the 20 latin-based languages of the Princeton University Cataloging Documentation.
As the regular expression became longer, I decided to structure them in index, in order to shorten the average lookup time. For instance : "January", "June" and "July" all start with a "j", and if my month name doesn’t start with this letter, I can avoid three tests adding a preliminary one. We can also group "March" and "May", "April" and "August"…
The week after I decided to continue this world wide journey, with a better map : the Wiktionnary’s Appendix:Months_of_the_year. I reached 50 languages, adding a first barrier of regular expression to probe the Unicode character range of the provided month name, and distinguish Latin styled month names from Cyrillic ones, Asian…
Then I put the file somewhere online, and the project slept for 6 years, waiting for Meta-Press.es to advance. It seems that nobody found the file nor decided to use it, as no one reached me about. I didn’t wanted to advertise it too much because it was spaghetti JavaScript code, hard to convert to other languages.
Still floating in the thoughts of the french translation of the Eloquent JavaScript book from Marijn Haverbeke, I knew I would have to separate data from code to clean up the program.
Below the surface of the machine, the program moves. Without effort, it expands and contracts. In great harmony, electrons scatter and regroup. The forms on the monitor are but ripples on the water. The essence stays invisibly below.
When the creators built the machine, they put in the processor and the memory. From these arise the two aspects of the program.
The aspect of the processor is the active substance. It is called Control. The aspect of the memory is the passive substance. It is called Data.
— The Two Aspects, The Book of Programming, EloquentJavaScript.net, Marijn Haverbeke
Here the regular expressions were the data, and my pile of if
statements was
the not so interesting code. Unfortunately the problem was not a priority
anymore and the idea kept begging in my head to hatch for a while.
So month_nb
waited for the next opportunity to be used, and the resume of
Meta-Press.es development became this opportunity. So I took care to unravel
the spaghetti, one by one, because for instance some Cyrillic languages have
the same month name shape as Latin-based languages, but not all of them, so the
Cyrillic work was distributed over different styles, with some pieces of logic
in the middle to ventilate from one to another, or because the Chinese month
names are Chinese numbers and could be computed instead of matched against
regular expressions…
Finally I got a simple JSON tree data structure, with each key being a regular expression, leading either to a sub-object (a tree branch) or a month number (a tree leaf). The code became a simple tree walking loop. 75% of what was the code became reusable JSON data.
At last, it looks like what I wanted it to be. It supports 69 languages, and you don’t have to know the language of your month name to get it’s number.
The next step will be to update my copy of the Wiktionnary’s Appendix:Months_of_the_year page and the data structure according to it. There are 14 more languages in it nowadays !
The program will also deserve a proper NodeJS packaging, but maybe someone could contribute it ?
Still, now the month_nb function is available and can be used to convert dates when parsing a new newspapers for Meta-Press.es !