Visualizing userscripts use in MusicBrainz¶

In this notebook I'm interested in looking at how much MusicBrainz editors use advanced tools to modify the MusicBrainz website interface and help them edit faster. In order to do this I used a snapshot of the MusicBrainz (editing history included) from early March 2017.

Disclaimer: I am myself a developper and heavy user of userscripts, so this notebook should also help learn about the community that uses my scripts.

About userscripts¶

Userscripts (using TamperMonkey, GreaseMonnkey...) are code that can be applied on certain webpages to modify the way it is displayed or behaves. Advanced MusicBrainz editors often use these scripts to simplify repetitive operations (similarly to batch operations the relationship editor allows) and replace actions that normally require dozen of clicks in the satandard interface. Userscripts are usually not recommended for beginner editors since they can modify the standard behavior or make pages more complex (adding boxes for new actions).

%run startup.ipy

Last notebook update: 2018-06-06
Git repo: git@bitbucket.org:loujine/musicbrainz-dataviz.git
Importing libs

Defining database parameters

Defining *sql* helper function
Last database update: 2018-06-02

Python packages versions:
numpy       1.14.3
pandas      0.23.0
sqlalchemy  1.2.8
CPython 3.7.0b5
IPython 6.4.0

Looking for edits using a certain userscript¶

Let's take the example of the "MASS MERGE RECORDINGS" script written by jesus2099. It allows to merge easily a set of recordings, e.g. recordings used on two editions of the same album that were added indenpendently in MusicBrainz.

The simplest way to find edits using this userscript is to look for the automatic edit note left by this script:

# this query can take several minutes
import psycopg2
with psycopg2.connect(host=PGHOST, database=PGDATABASE, 
                      user=PGUSER, password=PGPASSWORD) as cnx:
    crs = cnx.cursor()
    crs.execute("""
SELECT COUNT(*) 
  FROM edit_note 
 WHERE text ILIKE '%MASS MERGE RECORDINGS%';
""")
pprint(crs.fetchall())

[(511357,)]

So more than 500.000 edits used this script (note that since the edit note is written for each recording used in the mass merge, the actual number of calls to this script is probably at least twice lower – still! that's impressive).

Looking for editors using a certain userscript¶

So what's next? The edit_note table also gives us the name of the editor in each case, so we should be able to regroup edits by editor:

# this query can take several minutes
df = sql("""
SELECT editor.name AS editor,
       1 AS cnt,          -- added to make easier to compute partial counts later
       post_time AS date
  FROM editor
  JOIN edit_note ON editor = editor.id
 WHERE edit_note.text ILIKE '%%MASS MERGE RECORDINGS%%';
""")
df.date = df.date.apply(lambda d: d.normalize()) # simplify dates, remove the hours
df.head()

We can regroup the edits by editor to have partial counts:

df.groupby('editor').sum().size

191

So 191 distinct editors used this script. Let's find the most prolific editors:

editor_count = df.groupby('editor').sum().sort_values(by='cnt', ascending=False)
editor_count.head(10)

So jesus2099 is only the second most frequent user of his own script. What if we plot the distribution of the 50 most frequent users?

%matplotlib inline
editor_count.head(50).plot(kind='bar', rot=90, figsize=(16,12), fontsize=10)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb14d159eb8>

As we could expect, a few users used the script more than 10.000 times, and we have a long tail of people who used the script just a few times.

Using plotly¶

Up to now we used the matplotlib library, which is great for static graphics. One other possibility is to use the plot.ly library for interactive graphs, using its IPython interface.

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

iplot([go.Bar(x=editor_count.head(50).index, y=editor_count.head(50).cnt)])

Displaying a userscript use with time¶

date_count = df.groupby('date').sum()
date_count.head(10)

iplot([go.Scatter(x=date_count.index, y=date_count.cnt)])

This is the distribution of edits using the userscript by day. Of course fluctuations are quite important... it might be cleaner to display the number of edits by week. Fortunately pandas allows to resample data easily:

dc = date_count.asfreq('W').fillna(0)
iplot([go.Scatter(x=dc.index, y=dc.cnt)])

Much nicer. What if we try to combine "by user" and "by date"?

editor_and_date_count = (df.pivot_table(
    'cnt', index='date', columns='editor', aggfunc='sum')
    .asfreq('D')
    .fillna(0)
    .loc['2014-05-01':, lambda df: df.sum(0) > 5000])
editor_and_date_count.head(10)

data = []
for editor in editor_and_date_count.columns:
    trace = go.Bar(
        x=editor_and_date_count.index,
        y=editor_and_date_count[[editor]].values.flatten(),
        name=editor
    )
    data.append(trace)

iplot(go.Figure(data=data, 
                layout=go.Layout(barmode='stack')))

You should be able to interact with this graph, zoom in and out, etc.

	cnt
editor
pomes27	71753
jesus2099	57144
pankkake	48962
bflaminio	42605
monxton	33356
loujin	22009
a23bed	19315
drsaunde	16276
ListMyCDs.com	14583
ProfChris	14414

	cnt
date
2011-12-13 00:00:00+00:00	6
2011-12-15 00:00:00+00:00	13
2011-12-30 00:00:00+00:00	11
2012-01-03 00:00:00+00:00	13
2012-01-16 00:00:00+00:00	16
2012-01-17 00:00:00+00:00	13
2012-01-24 00:00:00+00:00	13
2012-01-27 00:00:00+00:00	42
2012-07-03 00:00:00+00:00	3
2012-09-04 00:00:00+00:00	1

editor	HibiscusKazeneko	ListMyCDs.com	Midness	ProfChris	Senax	TheBookkeeper	a23bed	bflaminio	drsaunde	ebz777	jesus2099	loujin	maxmh	monxton	otters61	pankkake	pomes27	reosarevok	tigerman325
date
2014-05-01 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-02 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-03 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-04 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-05 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-06 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-07 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-08 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-09 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2014-05-10 00:00:00+00:00	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0