 # Finding Continuous Data Within Large Datasets

Alex Smith
Software Development

Finding continuous data within a large dataset can be tricky. Doing so efficiently is often an even more challenging feat to pull off. So today, we’ll be looking at what I believe is one of the most practical and creative uses of SQL, the Tabibito-san Method. Originally published and named in 2009 on the Oracle forums by Aketi Jyuuzou this method has seen widespread use since, jumping from Oracle to every other database system. For today we’ll be using PostgreSQL.

Let’s start by taking a look at the Tabibito-san query below:

`````` WITH data AS (
SELECT UNNEST(ARRAY [1, 2, 3, 7, 8, 9, 11, 12, 13, 14, 15]) AS continuous
)
SELECT
continuous,
ROW_NUMBER() OVER (ORDER BY continuous) AS row_num,
continuous - ROW_NUMBER() OVER (ORDER BY continuous) AS distance
FROM data;``````

Our continuous column shows that our sample data jumps slightly but has consecutive numbers at 1-3, 7-9 and 11-15. Our row_num column does exactly what you’d expect and counts up continuously. Lastly, we have our distance column, that through some magic, has the same values for each group of consecutive numbers. So what exactly is happening there? The most straightforward answer, and the reason this method is called “Tabibito-san”, which roughly translates to “The Traveler” or “The Wanderer,” is that it calculates based on distance traveled. Whenever the distance traveled for each is in lockstep, i.e., both numbers increase by one, then the distance column will output the same value. Inversely, a new distance value will be output whenever the continuous column and the row_num column increase at different rates.

Let’s take this one step further with another query:

``````WITH data AS (
SELECT UNNEST(ARRAY [1, 2, 3, 7, 8, 9, 11, 12, 13, 14, 15]) AS continuous
),
intermediate AS (
SELECT
continuous,
ROW_NUMBER() OVER (ORDER BY continuous) AS row_num,
continuous - ROW_NUMBER() OVER (ORDER BY continuous) AS distance
FROM data
)
SELECT
MIN(continuous) AS first_continuous,
MAX(continuous) AS last_continuous
FROM intermediate
GROUP BY distance
ORDER BY first_continuous;``````

Now that’s something a little more helpful! We’ve picked out the start and end values of each continuous grouping without resorting to recursion, window functions, or using lead or lag functions, making this an incredibly efficient method.

Finding continuous numbers is all well and good but not a particularly commonplace problem; lucky for us, this same method can also be extended to handle continuous dates, which is more beneficial. Let’s load up some more exciting sample data and take a look.

In this sample data, we’ve got a table called, you guessed it, sample_data, which consists of an id, a date, some data points (red, green, blue, and nir), and a related_id.

Let’s try another query:

``````WITH intermediate AS (
SELECT
related_id,
date,
red,
date - (CAST(ROW_NUMBER() OVER (PARTITION BY related_id ORDER BY date) AS int)) AS distance
FROM sample_data
WHERE red > 500
)
SELECT
related_id
MIN(date) as first_continuous,
MAX(date) as last_continuous
FROM intermediate
GROUP BY related_id, distance
ORDER BY related_id, distance;``````

As you can see with this one, Tabibito-san can easily be expanded to handle dates. The only real caveat here is that ROW_NUMBER() is a BigInt, and to add or subtract from dates; you have to cast it to a normal int. Beyond that, we’re also now grouping by our related_id and filtering for rows where our red value is over 500 so that we end up with the result that shows us all time periods for each related_id where the red value is over our threshold.

With this more real-world example, I hope you, too, enjoy the power, simplicity, and elegance of the Tabibito-san method for finding continuous data!