Distribution of edits by editor experience¶

I was intrigued by this graph for OSM that shows the distribution of edits on OpenStreetMap depending on their experience (apologies to the person who suggested it on IRC, I don't remember who this was). Their conclusion is that the majority of edits are actually made by a small minority of experienced editors.

Is the situation similar for MusicBrainz?

Setup¶

%run startup.ipy

Last notebook update: 2018-06-07
Git repo: git@bitbucket.org:loujine/musicbrainz-dataviz.git
Importing libs

Defining database parameters

Defining *sql* helper function
Last database update: 2018-06-02

Python packages versions:
numpy       1.14.3
pandas      0.23.0
sqlalchemy  1.2.8
CPython 3.7.0b5
IPython 6.4.0

Fetch data from the DB¶

We could limit the edit history by date (open_time > ...) but let's take everything since the current edit system was created (in 2012 I think).

edits_count = sql("""
SELECT editor.name AS editor,
       COUNT(*) AS cnt
  FROM edit
  JOIN editor ON editor.id = edit.editor
 WHERE editor.name != 'ModBot'
-- AND edit.open_time >= '2017-01-01'
GROUP BY editor.name
ORDER BY cnt DESC
-- LIMIT 1000
;
""")

edits_count.index = edits_count.editor
edits_count.drop('editor', axis=1, inplace=True)
edits_count.head()

Those are the most active editors (i.e. highest number of edits) as of mid-2018, the results should be close to the official editors statistics page.

print('Number of editors: {}'.format(len(edits_count)))
print('Number of edits: {}'.format(edits_count.sum().values[0]))

Number of editors: 195938
Number of edits: 49148162

Split editors in bins¶

We can split the editors in different bins to correspond to their "experience level" (from complete novice to old and wise auto-editor). In order to do that, we add a "category" column in our dataframe.

Note that the limit between bins is completely arbitrary.

bounds = [0, 5, 10, 20, 50, 100, 1000, 10000, 100000, 1000000, 10000000]
names = ['hit-and-run', 'newbie', 'casual', 'great', 'heavy', 
         'super', 'legendary', 'fantastic', 'mega', 'epic']

edits_count['category'] = pandas.cut(edits_count.cnt, bins=bounds)

edits_count.head()

So there are (currently) 4 editors in the "epic" category (more than 1 million edits).

Split edit count by category¶

Now we want to compute the total count of edits made by editors in each category.

cats = edits_count.groupby('category').count()
cats = cats.rename({"cnt": "nb_editors"}, axis="columns")
cats['nb_edits'] = edits_count.groupby('category').sum().values
cats.index = ['{name} {idx}'.format(name=name, idx=idx)
              for (name, idx) in zip(names, cats.index)]
cats

Let's plot those results as bar graph and pie charts using plotly (so that the graphs are interactive).

iplot({
    'data': [{'type': 'bar', 'x': cats.index, 'y': cats.nb_editors}],
    'layout': {'title': 'Number of editors by category',
               'xaxis': {'title': 'Editor category'},
               'yaxis': {'title': 'Number of editors'},
              }
})

No surprise there, the immense majority of editors have only a few edits...

iplot({
    'data': [{'type': 'bar', 'x': cats.index, 'y': cats.nb_edits}],
    'layout': {'title': 'Number of edits by category',
               'xaxis': {'title': 'Editor category'},
               'yaxis': {'title': 'Number of edits'},
              }
})

... but the majority of edits are made by experienced users (more than 100 edits each).

Same result as pie charts:

iplot({
    'data': [{'type': 'pie', 'labels': cats.index, 'values': cats.nb_editors, 
              'sort': False, 'direction': 'clockwise'}],
    'layout': {'title': 'Number of editors by category'}
})

iplot({
    'data': [{'type': 'pie', 'labels': cats.index, 'values': cats.nb_edits, 
              'sort': False, 'direction': 'clockwise'}],
    'layout': {'title': 'Number of edits by category'}
})

	cnt
editor
reosarevok	1721106
TheBookkeeper	1645012
drsaunde	1305561
ListMyCDs.com	1142201
HibiscusKazeneko	816887

	cnt	category
editor
reosarevok	1721106	(1000000, 10000000]
TheBookkeeper	1645012	(1000000, 10000000]
drsaunde	1305561	(1000000, 10000000]
ListMyCDs.com	1142201	(1000000, 10000000]
HibiscusKazeneko	816887	(100000, 1000000]

	nb_editors	nb_edits
hit-and-run (0, 5]	104858	231326
newbie (5, 10]	27866	212614
casual (10, 20]	22096	324307
great (20, 50]	19110	609774
heavy (50, 100]	8549	606343
super (100, 1000]	11006	3137339
legendary (1000, 10000]	1925	5487680
fantastic (10000, 100000]	449	13314860
mega (100000, 1000000]	75	19410039
epic (1000000, 10000000]	4	5813880