In this work we investigate a wide set of machine learning models of data-driven approaches (Long Short-Term Memory networks, Convolutional neural networks, multilayer perceptrons, Random Forest Classifiers, Logistic Regression and Gradient Boosting Classifiers with different sets of features) to identify the gender of author in Russian multi-genre texts in the case of existing style distortions and gender deceptions in training and testing sets. We consider and evaluate accuracy for the following situations: the influence of style distortions and gender deceptions in training texts for different genre, and the case when such deception is present only in test results. A comparison with known literature data is presented.

The set of data corpora includes: one collected by a crowdsourcing platform, essays of Russian students (RusPersonality), Gender Imitation corpus, and the corpora used at Forum for Information Retrieval Evaluation 2017 (FIRE), containing texts from Facebook, Twitter and Reviews. We present the analysis of numerical experiments based on different features (morphological data, vector of character n-gram frequencies, LIWC and others) of input texts along with various machine learning models. The presented results, obtained on a wide set of data-driven models, establish the accuracy level for the task to identify gender of an author of a Russian text in the multi-genre case and analyzed the effect of the presence of deception in the test and training sets.

1.
F.
Rangel
,
P.
Rosso
,
B.
Verhoeven
,
W.
Daelemans
,
M.
Potthast
, and
B.
Stein
, “
Overview of the 4th author profiling task at pan 2016: cross-genre evaluations
,” in
Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al.
(
2016
), pp.
750
784
.
2.
T.
Litvinova
,
F.
Rangel
,
P.
Rosso
,
P.
Seredin
, and
O.
Litvinova
,
Notebook Papers of FIRE 8–10
(
2017
).
3.
A.
Sboev
,
I.
Moloshnikov
,
D.
Gudovskikh
, and
R.
Rybka
, “A comparison of data driven models of solving the task of gender identification of author in russian language texts for cases without and with the gender deception,” in
Journal of Physics: Conference Series
, Vol.
937
(
IOP Publishing
,
2017
) p.
012046
.
4.
A.
Sboev
,
I.
Moloshnikov
,
D.
Gudovskikh
,
A.
Selivanov
,
R.
Rybka
, and
T.
Litvinova
, “Automatic gender identification of author of russian text by machine learning and neural net algorithms in case of gender deception,” (
Elsevier B.V.
,
2018
), pp.
417
423
.
5.
O.
Litvinova
,
P.
Seredin
,
T.
Litvinova
, and
J.
Lyell
, “
Deception detection in russian texts
,” in
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics
(
2017
), pp.
43
52
.
6.
T.
Litvinova
,
O.
Litvinlova
,
O.
Zagorovskaya
,
P.
Seredin
,
A.
Sboev
, and
O.
Romanchenko
, “
“ruspersonality”: A russian corpus for authorship profiling and deception detection
,” in
Intelligence, Social Media and Web (ISMW FRUCT), 2016 International FRUCT Conference on
(
IEEE
,
2016
), pp.
1
7
.
7.
A.
Sboev
,
I.
Moloshnikov
,
D.
Gudovskikh
,
A.
Selivanov
,
R.
Rybka
, and
T.
Litvinova
,
Procedia Computer Science
123
,
424
431
(
2018
).
8.
Y. R.
Tausczik
and
J. W.
Pennebaker
,
Journal of language and social psychology
29
,
24
54
(
2010
).
9.
F.
Rangel
,
M.
Franco-Salvador
, and
P.
Rosso
, arXiv preprint arXiv:1705.10754 (
2017
).
10.
A.
Kutuzov
and
E.
Kuzmenko
,
EACL 2017
p.
99
(
2017
).
This content is only available via PDF.
You do not currently have access to this content.