Count frequency data are found in many applications such as traffic accident analysis, car insurance claims, hospital admissions records and adverse vaccine reaction data. In some cases, these data have high zero counts and/or heavy tails due to cases such as no claim or no incidence in insurance and accident data, respectively. In regression modelling, the data size can be formidable due to either large sample size and/or large number of covariates or predictors which leads to computational challenges. Although many computer engineering solutions are available through supercomputers and parallel computing, there exists limitations due to cost and accessibility. As such, statistical solutions have been considered to ameliorate the challenges posed by regression modelling of big data. In general, these statistical solutions are classified as divide and conquer approach, fine-to-coarse method and subsampling methods. To address the problem of large data size in count regression modelling, we propose a stratified subsampling strategy according to frequency classes with shrinkage leveraging for statistical inference. An attractive feature of this strategy is in its ability to preserve the characteristics of data like over dispersion, high zero counts and/or heavy-tailed. A Monte Carlo simulation study is conducted to investigate the performance of the proposed stratified subsampling method in regression modelling with big count data. The regression analysis will be illustrated using a family of mixed Poisson regression models which have been shown to be flexible in its ability to model count data with high zero counts and/or a long tail.

1.
X.
Meng
, in
Proc. 30th Int. Conf. Mach. Learn.
(
2013
).
2.
C.
Wang
,
M.-H.
Chen
,
E.
Schifano
,
J.
Wu
, and
J.
Yan
,
Stat Interface
9
,
399
(
2016
).
3.
C.
Meng
,
Y.
Wang
,
X.
Zhang
,
A.
Mandal
,
W.
Zhong
, and
P.
Ma
, in
Handb. Res. Appl. Cybern. Syst. Sci.
(
2017
), pp.
280
299
4.
A.
Kleiner
,
A.
Talwalkar
,
P.
Sarkar
, and
M.I.
Jordan
,
J. R. Stat. Soc. Ser. B
(
Statistical Methodol
.
76
, (
2014
).
5.
P.
Ma
and
X.
Sun
,
WIREs Comput. Stat.
7
,
70
(
2014
).
6.
F.
Liang
,
Y.
Cheng
,
Q.
Song
,
J.
Park
, and
P.
Yang
,
J. Am. Stat. Assoc.
108
,
325
(
2013
).
7.
F.
Liang
, and
J.
Kim
.
Technical Report
(
2013
).
8.
M.
Shaked
,
J. R. Stat. Soc. Ser. B
42
,
192
(
1980
).
9.
S.H.
Ong
,
Commun. Stat.-Theory Methods
24
,
253
(
1995
).
10.
Y.C.
Low
,
S.H.
Ong
, and
R.C.
Gupta
,
J. Stat. Theory Appl.
16
,
322
(
2017
).
11.
S.H.
Ong
,
Y.C.
Low
, and
K.K.
Toh
,
ASM Sci. J.
14
,
1
(
2021
).
12.
D.
Karlis
and
E.
Xekalaki
,
Int. Stat. Rev.
73
,
35
(
2005
).
13.
P.C.
Consul
,
Generalized Poisson Distributions
(
Dekker
,
New York
,
1989
).
14.
R.W.
Conway
and
W.L.
Maxwell
,
J. Ind. Eng.
12
,
132
(
1962
).
15.
K.F.
Sellers
and
G.
Shmueli
,
Ann. Appl. Stat.
4
,
943
(
2010
).
16.
K.F.
Sellers
and
B.
Premeaux
,
WIREs Comput. Stat.
13
, (
2021
).
17.
M.
Li
,
D.
Li
,
S.
Shen
,
Z.
Zhang
, and
X.
Lu
, in
Lect. Notes Comput. Sci.
, edited by
G.
Gao
,
D.
Qian
,
X.
Gao
,
B.
Chapman
, and
W.
Chen
(
2016
), pp.
133
146
.
18.
P.
Ma
,
M.W.
Mahoney
, and
B.
Yu
,
J. Mach. Learn. Res.
16
,
861
(
2015
).
19.
P.
Drineas
,
M.W.
Mahoney
,
S.
Muthukrishnan
, and
T.
Sarlos
,
Numer. Math.
117
,
219
(
2011
).
20.
S.
Chatterjee
and
A.S.
Hadi
,
Stat. Sci.
1
,
379
(
1986
).
21.
J.
Jia
,
M.
Michael
,
D.
Petros
, and
Y.
Bin
,
Influence Sampling for Generalized Linear Models
(
2014
).
22.
P.
Drineas
,
M.
Magdon-Ismail
,
M.W.
Mahoney
, and
D.P.
Woodruff
,
J. Mach. Learn. Res.
13
,
3475
(
2012
).
23.
T.
Yang
,
L.
Zhang
,
R.
Jin
, and
S.
Zhu
, in
Proc. 32nd Int. Conf. Mach. Learn.
(
2015
), pp.
135
143
.
24.
Z.
Wang
,
H.
Zhu
,
Z.
Dong
,
X.
He
, and
S.L.
Huang
, in
AAAI 2020-34th AAAI Conf. Artif. Intell.
(
2020
), pp.
6340
6347
.
25.
H.
Wang
,
R.
Zhu
, and
P.
Ma
,
J. Am. Stat. Assoc.
113
,
829
(
2018
).
26.
H.
Wang
,
J. Mach. Learn. Res.
20
,
1
(
2019
).
27.
H.
Wang
,
M.
Yang
, and
J.
Stufken
,
J. Am. Stat. Assoc.
114
,
393
(
2019
).
28.
M.
Ai
,
J.
Yu
,
H.
Zhang
, and
H.
Wang
, (
2018
).
29.
Y.
Yao
and
H.
Wang
,
J. Data Sci.
19
,
151
(
2021
).
30.
J.
Lee
,
E.D.
Schifano
, and
H.
Wang
,
Econom. Stat
. In Press, (
2021
).
31.
H.
Zhang
and
H.
Wang
,
Comput. Stat. Data Anal.
153
,
107072
(
2021
).
32.
M.
Ai
,
F.
Wang
,
J.
Yu
, and
H.
Zhang
,
J. Complex.
62
,
101512
(
2021
).
33.
H.
Wang
and
Y.
Ma
,
Biometrika
108
,
99
(
2021
).
34.
T.
Zhang
,
Y.
Ning
, and
D.
Ruppert
,
J. Comput. Graph. Stat.
30
,
106
(
2021
).
35.
J.A.
Nelder
and
R.W.M.
Wedderburn
,
J. R. Stat. Soc. Ser. A
135
,
370
(
1972
).
36.
J.F.
Lawless
,
Can. J. Stat.
15
,
209
(
1987
).
38.
S.
Weisberg
,
Applied Linear Regression
,
4th
ed. (
Wiley
,
New York
,
2013
).
40.
P.
McCullagh
and
J.A.
Nelder
,
Generalized Linear Models
,
2nd
ed. (
Chapman & Hall/CRC
,
1989
).
41.
P.
Drineas
,
M.W.
Mahoney
, and
S.
Muthukrishnan
, in
Proc. 17th Annu. ACM-SIAM Symp. Discret. Algorithms
(
Miami
,
2006
), pp.
1127
1136
.
42.
D.
Ting
and
E.
Brochu
, in
Proc. 32nd Int. Conf. Neural Inf. Process. Syst.
(
2018
), pp.
3654
3663
.
43.
P.
Puig
and
J.
Valero
,
J. Am. Stat. Assoc.
101
,
332
(
2006
).
44.
B.
Ripley
,
B.
Venables
,
D.M.
Bates
,
K.
Hornik
,
A.
Gebhardt
, and
D.
Firth
, (
2022
).
45.
C.
Dean
,
J.F.
Lawless
, and
G.E.
Willmot
,
Can. J. Stat.
17
,
171
(
1989
).
46.
G.Z.
Stein
and
J.M.
Juritz
,
Commun. Stat.-Theory Methods
17
,
557
(
1988
).
47.
L.
Cheng
,
S.R.
Geedipally
, and
D.
Lord
,
Saf. Sci.
54
,
38
(
2013
).
48.
Y.C.
Low
and
S.H.
Ong
, in
Mathematical Modelling and Computational Intelligence Techniques
,
Springer Proc
. (
2021
),
3
76
, pp.
93
109
.
This content is only available via PDF.
You do not currently have access to this content.