Open Data Ukulele
Learning data of 1000 anonymized ukulele users

Open access dataset


Open-access dataset for music education research by Yousician.

The dataset consists of the learning data of 1000 ukulele users (anonymised), covering the first 30 days after their signup to the Yousician app.

For every musical note or chord within every song that the users play, we include detailed information on what the user was supposed to play, whether the note or chord was played correctly, and how accurate was the timing. In the case of note pitch mistakes, we include information of which note was played instead, or if no note was played at all.

The dataset can be used, for example, to study what kind of mistakes learners make, what kind of learning difficulties they have, how their performance improves over time, and how different practising habits affect learning efficiently, to mention just a few.

  • Over 500K played song exercises, adding up to over 10M evaluated instances of notes, double stops, or chords
  • Anonymised user ids and time data are provided, to explore approaches for personalised, adaptive instruction
  • The learning path of each user can be ordered sequentially, to study/model instrumental learning, and forgetting, over time
  • Over 100 different chord voicings
  • Over 100 different double stop voicings
  • Melodic (single note) instances covering roughly 2 octaves (C4-B5)
  • Songs comprising a wide variety of styles, and difficulty levels

Ukulele learning

Yousician supports different notation views, including tablature and standard notation. Users get instant feedback on pitch (1) and timing (2). In play mode, the more accurate the user’s playing, the more points they earn (4). The user can switch to practise mode, where one can slow down the tempo to learn the trickiest parts of a song.

About the Sample

  • First 30 days of activity for each user
  • Users sampled from the 2019 summer months
  • All personal information has been anonymised
  • All song information has been anonymised

Users on this sample have been playing or practising with the app an average of roughly 10 hours. And most users have started at level 0 and progressed to successfully play songs in levels [3-5], which means they have made their way from absolute beginners to mid-level beginners.

About The Dataset

The dataset is provided as a JSON array of objects, eg

[{"key_1":"v_1", ... , "key_n":"v_n"}, ..., {"key_1":"v_1", ... , "key_n":"v_n"}]

Each object is an attempt of a user to play an exercise part (eg the intro or verse). An exercise is an arrangement of a song for one of the supported instruments (only ukulele in this case), at a particular difficulty level.

Object Description

Below we describe the user attempt object properties, grouping them into identifiers, indexes, exercise metadata, global evaluation data, and fine-grained evaluation data (note-by-note).


  • user_id [string] : Unique id for each user
  • song_id [string] : Unique id for each song
  • exercise_id [string] : Unique id for each song arrangement

All ids are B64-encoded, 8char.

The dataset contains over 500 songs, adding up to over 900 exercises. Each user has played at least 98 songs. And, users, on average, make around 10 attempts at playing a song’s exercise.


  • days_since_signup [number] : Index of the day relative to the user’s signup date
  • session_index [number] : Index of the user session.
  • exercise_part_index [number] : Index of song part (eg ‘intro’, ‘chorus’)

The days_since_signup field goes from 0 to 31, as a float with 6 decimal points. It allows temporal sequencing of every attempt to play a given exercise within a session.

Exercise data:

  • song_type [string] : Either ‘yousician_song’ or ‘weekly_challenge’
  • time_playing [number]: Time playing an exercise, in seconds
  • is_played_in_full [bool]: The user attempted to play the exercise in full, as opposed to just a part of the song
  • play_mode [string]: Either ‘play’ or ‘practise’
  • exercise_difficulty_level [number]: Level in the Yousician scale
  • exit_status [string]: succeeded/restarted/quit/failed/unknown (with “unknown” as the default status for practice mode)

Standard ‘yousician_songs’ are featured in the syllabus or home-screen. Conversely, ‘weekly_challenges’ are songs presented (initially) in a contest setting. They are often thematically prepared, and target a specific style or technique.

The Yousician difficulty scale goes from 0 to 15. Exercises under 5 are appropriate for beginners, between 5-10 are intermediate, and above 10 are advanced.

The time playing an exercise is measured differently depending on the selected song evaluation mode. In play mode, the measurement does not include time spent on pause, while in practise mode it does.

Global Evaluation Data:

  • notes_evaluated [number]: Total number of single notes in the exercise
  • notes_successful [numer]: Number of notes played correctly
  • chords_evaluated [number]: Total number of chords in the exercise
  • chords_successful [number]: Number of chords played correctly

Correctly played here means that the correct note or chord was played at a time that was sufficiently close to the notated timing.

All attempts, whether played in play or practise mode, get global evaluation data.

Fine-Grained Evaluation Data (note-by-note):

  • events_data (JSON formatted string)
    • data.reject_reason [array[int]]: Why a note or chord was considered incorrectly played — See addendum.
    • data.timing_offset [array[int]]: Time between played and notated onset. In centiseconds (0.01s). E.g. -2 means played 20ms early.
    • data.pitch_offset [array[int]]: Deviation between played and notated pitch. In semitones.
    • data.duration [array[int]]: Notated note/chord duration. In centiseconds (0.01s). E.g. 123 means 1.23 seconds — See addendum.
    • data.pitches [array]: Each element an array of MIDI note numbers — See addendum.
    • data.strings [array]: Each element an array of Ukulele string indexes — See addendum.

The ‘events_data’ field is itself a JSON object, provided as a JSON string. Note-by-note evaluation data is only available for play mode data. You can still find the JSON structure within practise mode data, but all the arrays are empty.


Known evaluation issues

The detection of the C chord, voiced at first position (tab <0003>), is intentionally picky. Since it is relatively easy to play, even for absolute beginners, we require that the one string that requires fretting-hand involvement is pressed and plucked in a way that it is clearly discernible to the app’s evaluator.

Conversely, the detection of the Am7 chord, at second position (tab <2433>), is intentionally loose: we allow for less than perfect chord performance, eg unintended partial muting of some strings.


    • 0: None (The note or chord was played correctly, and highlighted green)


    • 2: Pitch content all wrong


    • 3: Onset missing


    • 4 or 12: User playing not audible


    • 5: Wrong pitch class present


    • 6: First pitch class missing


    • 8: Second pitch class missing


    • 9: Third pitch class missing


    10: Fourth or higher pitch class missing


Each element within ‘durations’ is a number representing note or chord durations in centiseconds. Duration values have been rounded to {~10, ~30, ~50, ~70, ~90}, with 90 csecs as a cap for any longer duration, e.g.

[13, 45, 83, 125] maps to [10, 50, 90, 90]

Each element within ‘pitches’ is an array of midi notes. single notes are represented as arrays of length 1, and double stops and chords as arrays of length > 1.

For the latter, the first element is a MIDI note number, and the rest are cumulative differentials with respect to that number, e.g.

[60, 4, 3, 5] == [60, 64, 67, 72]

Each element within ‘strings’ is an array. It specifies the string in which the pitch from ‘pitches’ was played, e.g.

[2, 1, 3, 0]

The string index starts from the bottom-most, highest-pitched string. Yousician songs are arranged for 4-string ukuleles, in standard tuning (gCEA), with reference A4 = MIDI_67 = 440 Hz

Open access dataset for education researchers

Download anonymized learning data of 1000 Ukulele users