Fixing Elasticsearch/Logstash/ELK's DATESTAMP grok pattern
Elasticsearch, including Logstash and Pipeline Processors, love to use grok patterns. These are basically named regex patterns, allowing the complexity to be hidden behind easier-to-read labels (though they do require referring to the source). Regexes are great, with the "now you have two problems" caveat of any sufficiently advanced technology. (it could be worse, you could have 100 problems)
What's the problem? Timestamp support. Such a trivial issue has been a problem for a long time. I'll show a fix (rather, a workaround) below.
The problem: year-first timestamps
Many datestamps in logs are in a year-first format (e.g., 2020-01-01). That makes sense, as many operating systems and languages default to ISO 8601 for a human-readable datetime format. For instance, here's a recent example from my system's dpkg.log
:
2020-05-21 06:01:01 upgrade tzdata:all 2019c-3ubuntu1 2020a-0ubuntu0.20.04
Or from a Mac's log:
2020-05-24 20:04:25-07 ted-macbook-pro softwareupdated[753]: Removing client SUUpdateServiceClient pid=5347, uid=0, installAuth=NO rights=(), transactions=0 (/usr/sbin/softwareupdate)
Or from Octoprint's python-based logs:
2020-05-25 03:26:36,643 - octoprint.server.heartbeat - INFO - Server heartbeat <3
There are other common variants, like the 'T' delimiting the date and time sections, time zones, or other delimiters- I will also be addressing this format, found in the grok comment:
2020/05/25-16:02:11.5533
I'm less concerned about time zones, since real computers run on UTC.
What's wrong with these? none of them are supported in vanilla grok in any version.
Digging into the problem
Let's look at logstash in 2015. Or logstash in 2020. Or elasticsearch in 2020.
They define UK-format and US-format dates:
DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
And in a comment, they suggest that the datestamp will be accepted with slashes:
# datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
Which leads to an implication that the DATESTAMP
would support it:
DATE %{DATE_US}|%{DATE_EU}
DATESTAMP %{DATE}[- ]%{TIME}
But.. look back to the US/EU formats. No year-first format. Sometimes you'll see a weird match, like "20 April 2001", but it's just seeing "20
20/04/01
", slicing off the first few digits, and parsing it as a date-first string. Weird, huh? This explains some of the weird indexes you might find in an ELK stack, where there's something like logstash-2001.04.01
and it's almost 20 years later.
Anyhow, on to...
The easy fix
If you have the luxury of redefining DATE
, you can prepend a sane format to it:
DATE_YEARFIRST %{YEAR}[\/\-\s]%{MONTHNUM}[\/\-\s]%{MONTHDAY}
DATE %{DATE_YEARFIRST}|%{DATE_US}|%{DATE_EU}
Why prepend? That keeps it from accidentally matching 20
20
as above.
The hard fix
But if you don't have the luxury of updating your grok patterns, you'll have to whip up a custom one:
(?<my_datetime_field>%{YEAR}/%{MONTHNUM}/%{MONTHDAY}[T\s\-]+%{TIME})
You'll notice I've simplified the delimiters from above. It's just less readable o accept them all, but you might need to do so:
(?<my_datetime_field>%{YEAR}[\/\-\s]%{MONTHNUM}[\/\-\s]%{MONTHDAY}[T\s\-]+%{TIME})
So, there you go. That can easily be stuffed into Logstash, or a processor. Hooray!
Caveats
Now, there is an 8601-style year-first pattern defined, so, great if it matches your format. It didn't match enough of my variants:
TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{ISO8601_HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
There was also a year-first version added to logstash in 2016, then the day and month were flipped, then it was removed or never made it to master. Hilariously, it was typoed as 8061 the whole time. It also didn't exist in elasticsearch, only logstash. It doesn't help that the Elasticsearch version of the file was moved in 2016, then also moved in 2018, which took away the easy-to-view commit history.
Why not fix it?
Here's an issue from 2015. Here's another one. Here's a PR from 2016. The issue isn't submitting a PR, obviously. Don't get me started on discuss.elastic.co
, which auto-closes and locks discussions, meaning you're pretty much guaranteed to find an out-of-date, inferior, solution. If any at all.