This is a simple powershell script that can be used to get the frequency of the first letter from a sample file.
gc ‘./sample’ | %{ $_.substring(0,1) } | group
Running this over say the FTSE 100 symbol list returns:
Count Name Group
----- ---- -----
10 A {A, A, A, A...}
11 B {B, B, B, B...}
4 C {C, C, C, C}
2 D {D, D}
2 E {E, E}
3 F {F, F, F}
3 G {G, G, G}
5 H {H, H, H, H...}
8 I {I, I, I, I...}
3 J {J, J, J}
1 K {K}
4 L {L, L, L, L}
4 M {M, M, M, M}
3 N {N, N, N}
1 O {O}
6 P {P, P, P, P...}
8 R {R, R, R, R...}
15 S {S, S, S, S...}
2 T {T, T}
2 U {U, U}
1 V {V}
2 W {W, W}
This highlights that the symbols are not uniformly spread across the alphabet.
A-F has 1/3 of the market as does P-Z
I found out this once when trying to use the ticker symbol to load balance market data across 3 servers.
Check the distribution of the data before you use a simple key.
Oddly the second letter is a better key:
6 A {A, A, A, A...}
3 B {B, B, B}
4 C {C, C, C, C}
5 D {D, D, D, D...}
3 E {E, E, E}
6 G {G, G, G, G...}
4 H {H, H, H, H}
3 I {I, I, I}
2 K {K, K}
8 L {L, L, L, L...}
6 M {M, M, M, M...}
7 N {N, N, N, N...}
2 O {O, O}
4 P {P, P, P, P}
8 R {R, R, R, R...}
9 S {S, S, S, S...}
7 T {T, T, T, T...}
2 U {U, U}
6 V {V, V, V, V...}
2 W {W, W}
2 X {X, X}
1 Z {Z}