Finding Continuous Data Within Large Datasets

Finding continuous data within a large dataset can be tricky. Doing so efficiently is often an even more challenging feat to pull off. So today, we’ll be looking at what I believe is one of the most practical and creative uses of SQL, the Tabibito-san Method. Originally published and named in 2009 on the Oracle forums by Aketi Jyuuzou this method has seen widespread use since, jumping from Oracle to every other database system. For today we’ll be using PostgreSQL.

Let’s start by taking a look at the Tabibito-san query below:

 WITH data AS (
  SELECT UNNEST(ARRAY [1, 2, 3, 7, 8, 9, 11, 12, 13, 14, 15]) AS continuous
  )
  SELECT
    continuous,
    ROW_NUMBER() OVER (ORDER BY continuous) AS row_num,
    continuous - ROW_NUMBER() OVER (ORDER BY continuous) AS distance
  FROM data;

Our continuous column shows that our sample data jumps slightly but has consecutive numbers at 1-3, 7-9 and 11-15. Our row_num column does exactly what you’d expect and counts up continuously. Lastly, we have our distance column, that through some magic, has the same values for each group of consecutive numbers. So what exactly is happening there? The most straightforward answer, and the reason this method is called “Tabibito-san”, which roughly translates to “The Traveler” or “The Wanderer,” is that it calculates based on distance traveled. Whenever the distance traveled for each is in lockstep, i.e., both numbers increase by one, then the distance column will output the same value. Inversely, a new distance value will be output whenever the continuous column and the row_num column increase at different rates.

Let’s take this one step further with another query:

WITH data AS (
  SELECT UNNEST(ARRAY [1, 2, 3, 7, 8, 9, 11, 12, 13, 14, 15]) AS continuous
  ),
  intermediate AS (
  SELECT
    continuous,
    ROW_NUMBER() OVER (ORDER BY continuous) AS row_num,
    continuous - ROW_NUMBER() OVER (ORDER BY continuous) AS distance
  FROM data
  )
  SELECT
    MIN(continuous) AS first_continuous,
    MAX(continuous) AS last_continuous
  FROM intermediate
  GROUP BY distance
  ORDER BY first_continuous;

Now that’s something a little more helpful! We’ve picked out the start and end values of each continuous grouping without resorting to recursion, window functions, or using lead or lag functions, making this an incredibly efficient method.

Finding continuous numbers is all well and good but not a particularly commonplace problem; lucky for us, this same method can also be extended to handle continuous dates, which is more beneficial. Let’s load up some more exciting sample data and take a look.

In this sample data, we’ve got a table called, you guessed it, sample_data, which consists of an id, a date, some data points (red, green, blue, and nir), and a related_id.

Let’s try another query:

WITH intermediate AS (
  SELECT
    related_id,
    date,
    red,
    date - (CAST(ROW_NUMBER() OVER (PARTITION BY related_id ORDER BY date) AS int)) AS distance
  FROM sample_data
  WHERE red > 500
  )
  SELECT
    related_id
    MIN(date) as first_continuous,
    MAX(date) as last_continuous
  FROM intermediate
  GROUP BY related_id, distance
  ORDER BY related_id, distance;

As you can see with this one, Tabibito-san can easily be expanded to handle dates. The only real caveat here is that ROW_NUMBER() is a BigInt, and to add or subtract from dates; you have to cast it to a normal int. Beyond that, we’re also now grouping by our related_id and filtering for rows where our red value is over 500 so that we end up with the result that shows us all time periods for each related_id where the red value is over our threshold.

With this more real-world example, I hope you, too, enjoy the power, simplicity, and elegance of the Tabibito-san method for finding continuous data!

‍

Home

About Us

Our Work

Solutions

Architecture, Engineering, and Construction

Energy

Agriculture

Insights

Careers

Contact Us

Finding Continuous Data Within Large Datasets

More from Lofty

Generating PDFs using Python and Chrome's Headless mode

Generating PDFs using Python and Chrome's Headless mode

Bespoke technology unlocks AEC's key challenges

Bespoke technology unlocks AEC's key challenges

Lofty Ranks #2014 on the 2023 Inc. 5000 List of America's Fastest Growing Companies

Lofty Ranks #2014 on the 2023 Inc. 5000 List of America's Fastest Growing Companies