Dremel - repetition and definition level

Question

Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level.
while I understand the need for these two, to be able to disambiguate occurrences, it attaches a repetition and definition level to each value.

What is unclear to me is how they computed the levels...

As illustrated in picture:

It says:

Consider ﬁeld Code in Figure 2. It occurs three times in r1. Occurrences ‘en-us’ and ‘en’ are inside the ﬁrst Name, while ’en-gb’ is in the third Name. To disambiguate these occurrences, we attach a repetition level to each value. It tells us at what repeated ﬁeld in the ﬁeld’s path the value has repeated.

The ﬁeld path Name.Language.Code contains two repeated ﬁelds, Name and Language. Hence, the repetition level of Code ranges between 0 and 2; level 0 denotes the start of a new record. Now suppose we are scanning record r1 top down. When we encounter ‘en-us’, we have not seen any repeated ﬁelds, i.e., the repetition level is 0. When we see ‘en’, ﬁeld Language has repeated, so the repetitionlevelis2.

I just can't get me head around it, Name.Language.Code in r1 has en-us and en values. While is the first one r = 0 and the second one r = 2 is it because two definitions were repeated ? (language and code) ?

If it was:

Name
    Language
       Code: en-us
Name 
    Language
        Code: en
Name
    Language
        Code: en-gb

Would it be ?

0 2
1 2
2 2

Deﬁnition levels. Each value of a ﬁeld with path p, esp. every NULL, has a deﬁnition level specifying how many ﬁelds in p that could be undeﬁned (because they are optional or repeated) are actually present in record.

Why is then the definition level is 2 ? Isn't the path Name.Language contain two fields Code and Country where only 1 is optional\repeated ?

There is another great explanation of repetition and definition levels at blog.twitter.com/2013/dremel-made-simple-with-parquet, it may be worth reading as well. — Zoltan, Apr 24, 2017 at 19:57
@Zoltan yes I've already read it. So some sentences were a bit unclear, in general it was a great explanation. I'm surprised there was no dremel tag before this question... — Tony, Apr 24, 2017 at 20:02
here is one more description, and no the algorithm is not trivial by any mean: github.com/julienledem/redelm/wiki/… — Adelin, Aug 22, 2019 at 12:50

Peter O. · Accepted Answer · 2019-03-19 22:39:48Z

The Dremel striping algorithm is by no means trivial.

To answer your first question:

The repetition level of en-us is 0 since it is the first occurrence of a name.language.code path within the record.
The repetition level of en is 2, since the repetition occurred at level 2 (the language tag).

To answer your second question, for the following record,

DocId: 20
Name
  Language
    Code: en-us
Name 
  Language
    Code: en
Name
  Language
    Code: en-gb

the entries for name.language.code would be

en-us 0 2
en    1 2
en-gb 1 2

Explanation:

The definition level is always two, since the two optional tags name and language are present.
The repetition level for en-us is zero, since it is the first name.language.code within the record.
The repetition level for en and en-gb is 1, since the repetition occurred at the name tag (level 1).

The question and the answer are excellent, can you give some more random example so it becomes even clearer. — Adelin, Aug 22, 2019 at 12:32

Peter Boone · Accepted Answer · 2022-02-15 20:17:06Z

Adelin asked for some more random examples to make things clearer. To expand on user152468's answer:

DocId: 20
Name
  Language
    Code: en-us
Name 
  Language
    Code: en
Name
  Language
    Code: en-gb
  Language
    Code: zh
Name
  url: 'https://A'

Would be:

Code	Repetition	Definition
en-us	0	2
en	1	2
en-gb	1	2
zh	2	2
NULL	1	1

For zh, the definition level is 2 like the first three codes, but its repetition level is 2. The repetition level is not the number of times an item is repeated, it is the level in the tree that is repeated. The repetition level is 2 because Name.Language is repeated (which is a depth of 2). en and en-gb have repetition levels of 1 because Name is being repeated (not Name.Language).

The last NULL in the table refers to the missing Name.Language.Code for the https://A Name. The repetition level is 1 since Name is the repeated level. The definition level is 1 because there is only 1 optional level that is defined (Name).

Definition Levels

When trying to understand definition levels, it's important to know if the fields are required or optional (nullable or repeated). In the Dremel paper example, Code is required and Country is nullable. I struggled with this example for a while:

DocId: 20
Name
  Language
    Code: en-us
    Country: us

Code	Repetition	Definition
en-us	0	2

Country	Repetition	Definition
us	0	3

The definition level of en-us is 2 because there are 2 optional fields that are defined: Name and Language. The code itself is required, so it isn't counted for the definition level.

The definition level of us is 3 because there are 3 optional fields in the path: Name, Language, and Country.

So to answer the original question:

Why is then the definition level is 2? Isn't the path Name.Language contain two fields Code and Country where only 1 is optional\repeated?

The definition level doesn't convey how many optional fields are in the Language record. It conveys how many optional fields are defined in the path. It seems unnecessary when everything is defined, but it's useful when there are NULL values.

If the language record had more optional fields (Country, Direction, and Script):

DocId: 20
Name
  Language
    Code (required): en-us
    Country (nullable): us
    Direction (nullable): right_to_left
    Script (nullable): Latin

The definition levels of Code and Country wouldn't change:

Path	Value	Definition
`Name.Language.Code`	en-us	2
`Name.Language.Country`	us	3
`Name.Language.Direction`	right_to_left	3
`Name.Language.Script`	Latin	3

grey bear · Accepted Answer · 2023-05-22 15:28:00Z

0

According to my understanding, there is one confusing but important detail:

The repetition level is NOT the sequential index of the repeated field within the path, but the index of the field among the repeatable fields within the path.

Obviously, the repeated field is Name in your example, and that makes the level 1 for those not been the first field.

The real challenge is how to designate the repeated Links.Forwards. The sequential index of the repeated field, Forwards, is 2, but since field Links is not repeatable, the repetition is 1, not 2.

This page helps a lot as some comments already mentioned.

edited May 22 at 15:28

answered May 22 at 15:27

grey bear

778 bronze badges

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
– Community Bot
May 22 at 19:52

Add a comment |

Collectives™ on Stack Overflow

Dremel - repetition and definition level

3 Answers 3

Definition Levels

Your Answer

Not the answer you're looking for? Browse other questions tagged
algorithm
data-structures
dataset
parquet
dremel
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Definition Levels

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged algorithmdata-structuresdatasetparquetdremel or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
algorithm
data-structures
dataset
parquet
dremel
or ask your own question.