Mahler community cleanup

The classical music community on MusicBrainz decided to concentrate on one composer and do a monthly cleanup of Debussy data in the database.

Here is the state of the database early September 2017 before the cleanup

Setup

In [1]:
%run startup.ipy
Last notebook update: 2018-06-06
Git repo: git@bitbucket.org:loujine/musicbrainz-dataviz.git
Importing libs
Defining database parameters

Defining *sql* helper function
Last database update: 2018-06-02

Python packages versions:
numpy       1.14.3
pandas      0.23.0
sqlalchemy  1.2.8
CPython 3.7.0b5
IPython 6.4.0
In [2]:
composer_mbid = '8d610e51-64b4-4654-b8df-064b0fb7a9d9'
composer_name = 'Gustav Mahler'

Work list compared to IMSLP

In [3]:
def work_list(composer_name):
    composer = sql("""
SELECT gid AS mbid,
       name
  FROM artist
 WHERE name = %(composer_name)s;
    """, composer_name=composer_name)

    works = sql("""
SELECT w.name            AS work_name,
       lt.text_value     AS catalog_no,
       s.name            AS catalog,
       s.id              AS catalog_id,
       wt.name           AS work_type,
       la.name           AS language,
       watav.value       AS key,
       to_date(to_char(l.begin_date_year, '9999') ||
               to_char(COALESCE(l.end_date_month, 1), '99') ||
               to_char(COALESCE(l.end_date_day, 1), '99'), 'YYYY MM DD') AS start,
       to_date(to_char(l.end_date_year, '9999') ||
               to_char(COALESCE(l.end_date_month, 1), '99') ||
               to_char(COALESCE(l.end_date_day, 1), '99'), 'YYYY MM DD') AS end,
       w.gid             AS mbid
  FROM work          AS w
  JOIN l_artist_work AS law ON w.id = law.entity1
  JOIN artist        AS a   ON a.id = law.entity0
  JOIN link          AS l   ON l.id = law.link
LEFT OUTER JOIN l_series_work                     AS lsw ON w.id = lsw.entity1
LEFT OUTER JOIN series                            AS s   ON s.id = lsw.entity0
LEFT OUTER JOIN link_attribute_text_value         AS lt  ON lt.link = lsw.link
LEFT OUTER JOIN work_language                     AS wl  ON wl.work = w.id
LEFT OUTER JOIN language                          AS la  ON la.id = wl.language
LEFT OUTER JOIN work_type                         AS wt  ON w.type = wt.id
LEFT OUTER JOIN work_attribute                    AS wa  ON wa.work = w.id
LEFT OUTER JOIN work_attribute_type               AS wat   ON wat.id = wa.work_attribute_type
LEFT OUTER JOIN work_attribute_type_allowed_value AS watav ON watav.id = wa.work_attribute_type_allowed_value
 WHERE a.gid = %(composer_mbid)s
 ORDER BY start, work_name;
    """, composer_mbid=composer.mbid[0])

    works['url'] = works.mbid.apply(mb_work_link)
    return works

works = work_list(composer_name)
In [4]:
print('Number of works: {}'.format(works.shape[0]))

print('10 first works:')
iplot(ff.create_table(works[['start', 'work_name', 'url']].head(10)))
Number of works: 437
10 first works:

Comparison with works in IMLSP

In [5]:
imslp = pandas.read_html('http://imslp.org/wiki/List_of_works_by_Gustav_Mahler', header=0, index_col=0)[0]
In [6]:
imslp.head()
Out[6]:
Genre Title Key Notes
Year
1883–88 Symphony Symphony No.1 [“Titan”] D major (1883–88, rev. 1893–96, 2nd rev. 1906)
1887–94 Symphony Choral Symphony No.2[“Auferstehung” (“Resurrection“)] C minor –E♭ major (1887–94, rev. 1903) for Soprano, Alto, Chorus...
1893–96 Symphony Choral Symphony No.3 D minor –D major (1893–96, rev. 1902) for Alto, Female Chorus, ...
1892 Symphony Vocal Symphony No.4 G major –E major (1892, 1899–1901, rev. 1901–10) for Soprano an...
1901–02 Symphony Symphony No.5 C♯ minor –D major (1901–02, scoring repeatedly rev.)

Relationship on recordings

Should contain only recordings where Mahler was a performer (conductor)

In [7]:
sql("""
SELECT COUNT(*)
  FROM recording          AS r
  JOIN l_artist_recording AS lar ON r.id = lar.entity1
  JOIN artist             AS a   ON a.id = lar.entity0
 WHERE a.gid = %(composer_mbid)s
""", composer_mbid=composer_mbid)
Out[7]:
count
0 14

Relationship on releases

The "composer" relation on releases is deprecated

In [8]:
sql("""
SELECT COUNT(*)
  FROM release            AS r
  JOIN l_artist_release   AS lar ON r.id = lar.entity1
  JOIN artist             AS a   ON a.id = lar.entity0
 WHERE a.gid = %(composer_mbid)s
""", composer_mbid=composer_mbid)
Out[8]:
count
0 1

As release group artist

In [9]:
sql("""
SELECT COUNT(*)
--       rg.name AS album,
--       ac.name AS artists,
--       rg.gid AS mbid
  FROM artist             AS a
  JOIN artist_credit_name AS acn ON a.id = acn.artist
  JOIN artist_credit      AS ac  ON ac.id = acn.artist_credit
  JOIN release_group      AS rg  ON rg.artist_credit = ac.id
--  JOIN release            AS r   ON r.release_group = rg.id
--  JOIN release_event      AS ev  ON ev.release = r.id
 WHERE a.gid = %(composer_mbid)s
   ;""", composer_mbid=composer_mbid)
Out[9]:
count
0 1042

As recording artist

Should contain only recordings where Mahler was a performer (composer)

In [10]:
sql("""
SELECT COUNT(*)
  FROM artist             AS a
  JOIN artist_credit_name AS acn ON a.id = acn.artist
  JOIN artist_credit      AS ac  ON ac.id = acn.artist_credit
  JOIN recording          AS r   ON r.artist_credit = ac.id
 WHERE a.gid = %(composer_mbid)s
   ;""", composer_mbid=composer_mbid)
Out[10]:
count
0 145

Edits history

In [11]:
df = sql("""
SELECT edit.open_time AS date, 
       1 AS edit_count,
       editor.name AS editor
  FROM artist
  JOIN edit_artist ON artist.id = edit_artist.artist
  JOIN edit        ON edit.id = edit_artist.edit
  JOIN editor      ON editor.id = edit.editor
 WHERE artist.gid = %(composer_mbid)s
   ;""", composer_mbid=composer_mbid)

df.date = df.date.apply(lambda x: x.date())
df.head()
Out[11]:
date edit_count editor
0 2011-01-26 1 gswanjord
1 2009-03-16 1 cooperaa
2 2009-12-10 1 yuanyelele
3 2011-01-23 1 gswanjord
4 2008-05-03 1 leivhe
In [12]:
print('Total numbers of edits regarding {} in history: {}'.format(composer_name, len(df)))
Total numbers of edits regarding Gustav Mahler in history: 80509

Edits frequency by month

In [13]:
edit_frequency = df[['date', 'edit_count']].groupby('date').sum()
edit_frequency.index = pandas.to_datetime(edit_frequency.index)
edit_frequency = edit_frequency.resample('M').sum()
print(edit_frequency.tail(10))
iplot({
    'data': [dict(type='bar', x=edit_frequency.index, y=edit_frequency.edit_count)],
    'layout': {'title': "Number of edits per month for {}".format(composer_name),
               'xaxis': {'title': 'Date',
                         'range': [to_unix_time('2012-10-01'), to_unix_time('2018-01-01')]},
               'yaxis': {'title': 'Number of edits'},
              }
})
            edit_count
date                  
2017-08-31         518
2017-09-30        2099
2017-10-31       24079
2017-11-30        1863
2017-12-31         283
2018-01-31         356
2018-02-28         893
2018-03-31        1007
2018-04-30         362
2018-05-31          78

Edits split by editor

Edits by editor and day (limited to editors with at least 500 edits on Mahler)

In [14]:
pt = df.pivot_table('edit_count', index='date', columns='editor', aggfunc='sum').fillna(0)
pt = pt.loc[:, lambda pt: pt.sum(0) > 500]
In [15]:
iplot({
    'data': [{'x': pt.index, 'y': pt[field],
              'name': field, 'type': 'bar'} for field in pt.columns],
    'layout': {'title': "Number of edits per day per editor",
               'xaxis': {'title': 'Date',
                         'range': [to_unix_time('2017-09-01'), to_unix_time('2017-10-31')]},
               'yaxis': {'title': 'Number of edits',
                         'range': [0, 1500]}
              }
})