I was intrigued by this graph for OSM that shows the distribution of edits on OpenStreetMap depending on their experience (apologies to the person who suggested it on IRC, I don't remember who this was). Their conclusion is that the majority of edits are actually made by a small minority of experienced editors.
Is the situation similar for MusicBrainz?
%run startup.ipy
We could limit the edit history by date (open_time > ...) but let's take everything since the current edit system was created (in 2012 I think).
edits_count = sql("""
SELECT editor.name AS editor,
COUNT(*) AS cnt
FROM edit
JOIN editor ON editor.id = edit.editor
WHERE editor.name != 'ModBot'
-- AND edit.open_time >= '2017-01-01'
GROUP BY editor.name
ORDER BY cnt DESC
-- LIMIT 1000
;
""")
edits_count.index = edits_count.editor
edits_count.drop('editor', axis=1, inplace=True)
edits_count.head()
Those are the most active editors (i.e. highest number of edits) as of mid-2018, the results should be close to the official editors statistics page.
print('Number of editors: {}'.format(len(edits_count)))
print('Number of edits: {}'.format(edits_count.sum().values[0]))
We can split the editors in different bins to correspond to their "experience level" (from complete novice to old and wise auto-editor). In order to do that, we add a "category" column in our dataframe.
Note that the limit between bins is completely arbitrary.
bounds = [0, 5, 10, 20, 50, 100, 1000, 10000, 100000, 1000000, 10000000]
names = ['hit-and-run', 'newbie', 'casual', 'great', 'heavy',
'super', 'legendary', 'fantastic', 'mega', 'epic']
edits_count['category'] = pandas.cut(edits_count.cnt, bins=bounds)
edits_count.head()
So there are (currently) 4 editors in the "epic" category (more than 1 million edits).
Now we want to compute the total count of edits made by editors in each category.
cats = edits_count.groupby('category').count()
cats = cats.rename({"cnt": "nb_editors"}, axis="columns")
cats['nb_edits'] = edits_count.groupby('category').sum().values
cats.index = ['{name} {idx}'.format(name=name, idx=idx)
for (name, idx) in zip(names, cats.index)]
cats
Let's plot those results as bar graph and pie charts using plotly (so that the graphs are interactive).
iplot({
'data': [{'type': 'bar', 'x': cats.index, 'y': cats.nb_editors}],
'layout': {'title': 'Number of editors by category',
'xaxis': {'title': 'Editor category'},
'yaxis': {'title': 'Number of editors'},
}
})
No surprise there, the immense majority of editors have only a few edits...
iplot({
'data': [{'type': 'bar', 'x': cats.index, 'y': cats.nb_edits}],
'layout': {'title': 'Number of edits by category',
'xaxis': {'title': 'Editor category'},
'yaxis': {'title': 'Number of edits'},
}
})
... but the majority of edits are made by experienced users (more than 100 edits each).
Same result as pie charts:
iplot({
'data': [{'type': 'pie', 'labels': cats.index, 'values': cats.nb_editors,
'sort': False, 'direction': 'clockwise'}],
'layout': {'title': 'Number of editors by category'}
})
iplot({
'data': [{'type': 'pie', 'labels': cats.index, 'values': cats.nb_edits,
'sort': False, 'direction': 'clockwise'}],
'layout': {'title': 'Number of edits by category'}
})