This is a simple powershell script that can be used to get the frequency of the first letter from a sample file.
gc ‘./sample’ | %{ $_.substring(0,1) } | group
Running this over say the FTSE 100 symbol list returns:
Count Name Group ----- ---- ----- 10 A {A, A, A, A...} 11 B {B, B, B, B...} 4 C {C, C, C, C} 2 D {D, D} 2 E {E, E} 3 F {F, F, F} 3 G {G, G, G} 5 H {H, H, H, H...} 8 I {I, I, I, I...} 3 J {J, J, J} 1 K {K} 4 L {L, L, L, L} 4 M {M, M, M, M} 3 N {N, N, N} 1 O {O} 6 P {P, P, P, P...} 8 R {R, R, R, R...} 15 S {S, S, S, S...} 2 T {T, T} 2 U {U, U} 1 V {V} 2 W {W, W}
This highlights that the symbols are not uniformly spread across the alphabet.
A-F has 1/3 of the market as does P-Z
I found out this once when trying to use the ticker symbol to load balance market data across 3 servers.
Check the distribution of the data before you use a simple key.
Oddly the second letter is a better key:
6 A {A, A, A, A...} 3 B {B, B, B} 4 C {C, C, C, C} 5 D {D, D, D, D...} 3 E {E, E, E} 6 G {G, G, G, G...} 4 H {H, H, H, H} 3 I {I, I, I} 2 K {K, K} 8 L {L, L, L, L...} 6 M {M, M, M, M...} 7 N {N, N, N, N...} 2 O {O, O} 4 P {P, P, P, P} 8 R {R, R, R, R...} 9 S {S, S, S, S...} 7 T {T, T, T, T...} 2 U {U, U} 6 V {V, V, V, V...} 2 W {W, W} 2 X {X, X} 1 Z {Z}