I've updated the 0705 ATM data for 5 days using the new values, created a new combined data model, and printed all stats. I wrote 4 new Python programs to do this (in the code archive):
ExtractMaskedATMData.py : extract my old data by path/orbit/block-range which contains cf, an,
and ca values, and has been masked by hand.
ExtractNadirATMData.py : extract new nadir channel data.
CombineOldNewATMData.py : combine old and new data by averaging new values near old lat/lon
coordinates.
MakeNewATMDataModel.py : combine updated data into one file.
The combined data model (in the text archive) contains the following columns:
1 : day 2 : data source by path/orbit/block-range 3 : flight time in day 4 : lat 5 : lon 6 : old rms value 7 : new rms value 8 : number of new points averaged 9 : average lrms value from ATM data (this should be used rather than log(new rms)) 10 : rms of lrms values from ATM data 11 : cf 12 : an 13 : ca
Fields 9 and 10 merit some explanation. Using the new ATM rms values, several of which were averaged at each lat/lon point, I could immediately calculate the log(rms) of each new point, the average lrms at each coordinate, and the rms of the lrms point spread at that coordinate. Although not very different from the log of the average rms values, it is technically more accurate, and should be used as the primary input variable in the data model. Also, the rms of the lrms values of the new ATM values (rmsl) indicates the natural variance in the input data at each averaged point, and should be used as a standard against which all model estimations are compared. In the programs below, I have also calculated the total rmsl of the combined model, and of each sub-model.
I've also plotted, and printed stats, for all 5 day subsets, and all 21 path/orbit/block subsets of this model. See the images archive for PNGs of the entire model, and for specific days, and for specific data sources (path/orbit/ block-range). Note that the subsets are in very specific locations in image space, and are not generally distributed throughout. Also note that many subsets have fairly narrow ranges of lrms. See the Plot log file (in the text archive) for the stats of each subset. In general, the columns are min, max, mean, stddev.
I've updated 2 C programs from last year and completely rewritten the model comparison program (including the lrms estimation routine, although it is in principle identical to the previous algorithm). These programs are
PNGNewPlot3d.c : 3d plotting base
PlotNewATMModel.c : Plot data model or subsets (see log for examples of use)
CompareNewATMModels.c : Perform knockout estimation of model subset (see log for
examples of use)
I've done the knockout estimation analysis for all 5 days (but not yet for POB subsets). As predicted, the number of points estimated, and the quality of the estimation, was dependent on the parameter of the maximum permitted distance (in image space) between the point to be estimated and the average distance of the nearest 4 points of the remaining model. I ran these at 10, 1, 0.1, 0.01, and 0.001 "units" (based on actual Cf, An, and Ca values), and for all cases the value 0.1 allowed estimation of almost 100% of all unknown values. See the Compare log for results.
In general, the results were all "good" and pretty much the same, with some variation between days. I'm not doing the cross and histogram plots you are yet, but am looking at the min, max, mean, and stddev of actual and estimated lrms values, the differences between the two, and the total tms error (of the delta lrms), and these all look good. By "good" I mean that the estimated lrms min, max, mean, and stddev values are similar to the actual values, and that the delta lrms means are fairly small.
In particular, the estimation for 070511 does not look any worse than other days, and, for example, looks better than the result for 070503. The "best" result is for 070508. Again, I am looking at the estimated lrms min, max, mean, stddev, and the delta lrms mean.
Also as expected, performing random subset knockouts (for 10, 25, 50, and 75 percent of all model points) and estimations produces better and more consistent results. The results are pretty much the same, and good, even out to 3/4 of all points missing. Overall, it looks like the rms error of any random estimation is about equal to 2.5x the natural lrms variation in the original ATM input data.
Links to zip archives:
Data model and sub-model plots
New combined data model and program logs
Python and C programs
The next thing for me to do is to plot the actual and estimated values in 2d, to see if the results are really as good as the numerical stats suggest. I also want to compare the 2d plots of random knockouts with the daily and orbital estimations. I'll have more info online later this afternoon.
There are now new programs, images, and logs available in the archive links above (all cumulative from this morning).
I've modified the Compare program to save the lrms, estimated lrms, and delta lrms arrays to a file. Another program (PlotATMKnockout) reads these and creates a 2d scatter plot. These plots are pretty basic (no labels), and go from 0.0 to 6.75 on both axes. Although this program also calculates a PPMCC r value, I'm not sure that this is what you really want, since it detects any linearity at all (except for pure horizontal and vertical), whereas you want an indication of how close to the x=y axis all the values fall.
The knockout plots for each of 5 days, at a distance of 0.1 in image space, look bad. (See the images such as "atm0705_knockout_070503_0.100.png" for examples.) The one for 070503 is the worst (maybe because it has the most values (over half of the model)). The characteristic shapes of all seem mostly symmetrical(?), and are constrained to a square(?) of midrange lrms, with the majority of values (highest density) being at the lower left corner, where there is the best correlation. The next best correlation is at the upper right corner (i.e. both low and high lrms values). There are also dense "wings" of values that extend from the two corners along the sides of the square, indicating that a small range of actual lrms values has been estimated by a large range (vertical "wing"), or that a large range of lrms values has been estimated by a small range (horizontal "wing"). Other intermediate values are scattered somewhat uniformly throughout the interior of the "square". The r values are misleading, due to the vertical and horizontal wings, although most are around 0.53. The day that looks the best visually (070510), also has the best r value (about 0.83).
When the estimation distance parameter is reduced to 0.01, the plots and r values get worse, both visually and in r values. (See the images such as "atm0705_knockout_070503_0.010.png" for examples.) This is somewhat counter-intuitive, but is probably due to removing the range of lrms values that contribute to the better correlation, leaving only the dense "blob" at the lower left corner, and part of the wings.
070503 at dist = 0.1 and 0.01
070510 at dist = 0.1 and 0.01
The random knockout (10% of all model points) plots and correlations are much better, as expected, and have r values near 0.8 at the 0.1 estimation distance. (See the images such as "atm0705_knockout_rand0.10_0.100_1291683444.png".) These plots show dense elliptical clusters at lower lrms values, and less dense elliptical clusters at higher lrms values, with intermediate values spread out in "shallower" wings than the daily data. This improvement over the daily knockouts suggests that the non-homogenous distribution of the daily data removed something "essential" from the model which could not be estimated from the remainder, whereas the random knockouts were more distributed, and so could be better estimated by the remaining points.
When the estimation distance parameter is reduced to 0.01, and then to 0.0025, the random knockout plots and r values also get worse, although not by as much as the daily data. Again, elimination of estimated points leaves the blob at lower left, and traces of shallow wings along both sides.
random 10% at dist = 0.1, 0.01, and 0.0025