A) Do we modify management sections of the ensemble/config XML? Chris proposed to isolate management sections to different XML documents. I remember that Dirk took objection to the proposal. I'd like to know whether Chris and Dirk agreed on this issue at the offline meeting in Dublin. ((T.Yoshie, 2005 Nov 15)) Chris and Dirk disagree as to whether this should be in QCDml. I believe we agreed to make this optional so that I could exclude it from my QCDml IDs and Dirk could include them in his. Can Dirk corroborate this? I think it may be worth revisiting the discussions briefly to make sure everyone is happy with the outcome. (( C. Maynard, 2005-11-15) Yes, I still agree that we could make those entries optional. ((D. Pleiter, 2005-11-15)) I've read carefully Emails circulated in May-June. I'm sorry to bring this matter again, but I think it is not a good idea to make the management sections optional. They contain important information such as "history of documents/config" and "number of configurations" which is useful for users. If we make the management sections optional, such information will not be generally available. I agree, as Balint once pointed out, that management information in QCDml1.1 is not sufficient for managers, and local implementations of management of IDs are necessary. Nonetheless, I think that removing all part of the management sections from our standard is too drastic. I prefer keeping "minimal set of management information" as standard. I understand that Chris wants to remove the management sections mainly because Chris wants to write a valid ensemble XML ID before any configurations are generated or submitted to ILDG. In order to satisfy Chris's requirement, I ask you to consider Dirk's proposal in (qcdml 457): If one foresees an operation 'create' the QCDml could be made valid before it is made public. Alternatively, adding the attribute minOccurs="0" to the corresponding element would also do the job. Both modifications would be backward compatible. ((T. Yoshie, 2005-11-17)) I am in agreement with Chris, that management should be made optional. The reasons are varied. I am making commentary based on an example mark up found on the QCDML web page. First for an ensemble: http://www.ph.ed.ac.uk/ukqcd/community/the_grid/QCDml1.1/UKQCD1.xml -- action: add. Why is this useful? It says that someone has added some 829 configurations. Since it does not detail pointers to the individual 829 configurations what use is it? The number of configurations currently belonging to an ensemble can in principle be counted with an XPath query which will be more accurate than a hardwired number. Also this management information has to be manipulated every time data is uploaded, which is a pain. Looking at the configuration metadata http://www.ph.ed.ac.uk/ukqcd/community/the_grid/QCDml1.1/config010170.xml I see actions such as: generate. Is this useful? The fact that the data exists surely means it has been generated. The fact that it has been added by Chris may be useful, but it may be useless especially if it is legacy data and it is no longer known who has generated it. Further for non legacy data, it may not be easy to know who generated the data. A computer may supply a getuid() type function which allows the code to generate a UID on the fly. However, transcribing this to a name may not be so straighforward. Incidentally, I also have problems with and . For implementation, I'd like to know how the code producing the metadata (perhaps running on a parallel computer) should find out what computer it runs on? The machine may have a uname() call to find out about the processor, and a gethostname() call to find out its hostname - of course for a parallel machine each node may have a different hostname). Of course none of these calls are guaranteed to exist (see the case of the QCDOC). SO all this data would have to be munged into every config document after the fact (or by including the snippet of XML into the code somehow). This makes the creation of this tag awkward. Which means people won't use it. The problems with the tag are as follows: code may know its name and its version. What does the mean from the point of view of the code? Data of compilation? Date of running? Date of writing? Also what about multi layer code which uses many packages each with its own version? The precision tag is redundant and ambiguous. Is it the precision of the code or the produced configs. The precision of the binary data is now encoded in the Binary Format XML record and is not needed here. I believe that ALL the metadata a config needs is encoded in the and the which should be moved into . ((B. Joo, 2005 Nov 17)) Balint says: >> -- action: add. Why is this useful? It says that someone >> has added some 829 configurations. Since it does not detail pointers >> to the individual 829 configurations what use is it? No we have pointers. As far as I remember, revision number N in N of the ensemble XML document corresponds to the revision number N of the config XML document. Thus if you add 829 configs in revision N=1, you will find corresponding 829 config XML with N=1. I understand that your headache resides in generating XML IDs aromatically. I can compromise a bit, though I still think that it is not a good idea to remove the management sections or make them optional. I think and are mandatory. If we remove the management sections, we have to agree on "what to do when ILDG database is changed". Use cases and possible actions are 1) when you submit new ensemble: just register the ensemble XML ID with a Metadata Catalogue (MDC) 2) when you want to remove the ensemble: just remove the ID from the MDC 3) when you want to change the ensemble XML ID: remove the old one, register the new one 4) when you submit a configuration: register the config XML ID with a MDC and place the config at a TURL 5) when you remove the configuration: remove both the config XML ID and the config itself 6) when you want to replace a configuration: just replace the configuration, you don't have to change the config XML ID (Oh, this is revival of old discussions, someone sighed) Please note that users cannot traceback/detect any modifications anyhow, if some kind of is not mandatory. Are you sure that's OK? (I suppose that Middleware does not record any modifications. A log of modifications will be maintained only locally.) In addition, I'm worrying about two things; one is contact person and another is freshness of ensembles/configurations. Users may want to ask something to someone who is responsible for ensembles/configurations. If we remove management sections completely, it will be difficult for users to find a person. Collaboration name is the only clue. Is it OK? ILDG will be used for a long period of time. Numerous number of ensembles will be archived. Don't you want to filter out very old ensembles when you search configurations? It is impossible to do this, because no time stamps are recorded in XML IDs. I suppose that no time stamps are maintained by Middleware. Summary of this comment: May I ask you to consider whether we need some information of revision, person, and time stamp, when we remove archiveHistory. ((T.Yoshie, 2005 Nov 18)) OK First point: Regarding the 'revision' and 'add' tag in the Ensemble metadata and 'configuration metadata'. There is a presupposition that one is able to add many configurations at the same time, by a kind of 'batch add'. Suppose I add say 10 configs. Now my ensemble ID is at revision 1, and my 10 configs have revision 1. Now I add another 10 configs. My ensemble ID document gets its revision changed to revision 2. Does my insertion tool then have to ensure that the configuration XML-s of the second batch all have revision 2 in them? I believe this is what Tomoteru is suggesting above. I think this is not right. I thought that the 'revisions' tag in 'management' is a count of the total number of revisions in the ID document, not some kind of a key between the config ID and the XML ID. That would be bad. First, it is counterintuitive. Someone looking at the documents for the second batch of configs will see that they have 'revision 2' and will wonder what has changed since 'revision 1'. Secondly it is entirely possible that there is no batch add. To add say 100 configs, I'll run a script to add a single config in a loop 100 times. I use the script to add in things like the CRC checksum, the implementation info and anything else I can't produce with my code. In this case My ensemble ID document will i) fill up with 100 - "Balint added 1 configuration" tags ii) By the end of the process my ensemble ID will be revision 100 (and using the above logic, my individual configurations will have revisions ranging from 1 to 100) Is this really what is desired? I would advise against using the revision tag in the ensemble to point to the configs affected by that revision ation. I am willing to compromise that the ensemble archivalHistory be made optional, but would likt to require and its 'actions' should refer to the ensemble ID document only (ie an 'add' action should refer to the person adding the ensemble ID rather than adding configuration IDs). This would make the archiveHistory actions have the same semantics as the configuration actions (ie add, replace, remove). This way, in adding a configID, I don't need to touch the ensemble ID. I am also willing to make compromise in the case of the configuration IDs Clearly a contact info is desirable (although it is possible that during the lifetime of a configuration the contact will move between collaborations) I would propose the following compromise: i) The 'generate' revision action be removed from into the with something like: T3E-900 epcc Edinburgh Alpha processor UKQCD FORTRAN 16.8.3.1 1997-04-04T16:20:10Z Joe Bloggs The Gauge Generation Company UKQCD ii) Move the out of to the same level as precision, or into the and iii) make optional. This is (to my mind) sensible because the generation information is grouped with the other information to do with generation (code, machine, etc). After step ii) the only thing remaining in is the and the tag which is the number of elements in archiveHistory. Archival History can then be maintained by the actual archive (perhaps even in a separate XML document or logging database, or in the document if you prefer) In teh case of a separate system, when the XML file is retreived by a query, the archiveHistory could be reconstructed and pasted back into the ID document (filling out the optional tag), or not (if the particular MDC implementation doesn't support this). The archiveHistory tag then doesnt need to be pre-generated by the user for his document to still be valid, and is potentially normalised out into an optional piece. The rest can be generated by a script I suppose, but my gloomy prediction is that there will be lots of 'boilerplate information' there and there may be many errors as people forget to update their scripts. (( B. Joo, 2005 Nov 18)) As already pointed out in May (see qcdml-452) instead of dropping the element a redefinition of the sub-elements would also allow to take the concerns into account which recently have been raised again: 1) The archiveHistory in the ensemble XML document and configuration XML document stores only those operations which are related to the ensemble XML document or the configuration XML+binary files, respectively. With this rule none of the examples provided by Balint in qcdml-566 would apply anymore. In particular, when adding any configurations the archiveHistory of the ensemble XML document should NOT be extended, because this is not a change of the ensemble XML document itself. 2) In this case the element , , as mandatory. For revisions, starting with revision=1, we count-up when some change is made on the ensemble XML ID. b) add a mandatory element , in order to record "when generation of the ensemble started". I think this is necessary. If archiveHistory is made optional and some group really drops archiveHistory, the ensemble ID would have no time stamp. c) archiveHistory is optional. "add" refers to "submission of the ensemble ID to the ILDG, "replace" refers to "replacement of the ensemble ID", as proposed by Balint and Dirk. Of course, we remove . d) When one removes the ensemble, just remove the ensemble ID from MDC. If you use optional archiveHistory, you may keep the ID for a little while in MDC with "remove" added. The manegement section of ensemble ID looks like 2 UKQCD Clover NF=2 2002-04-04T13:20:10Z 1 add Chris Maynard University of Edinburgh 2003-01-10T15:20:10Z Submit this ensemble ID to ILDG 2 replace Chris Maynard University of Edinburgh 2004-02-18T15:20:10Z Modification is made on this ensemble ID For the management section of configuration ID, a) keep . The config ID looks like 3 1 add Chris Maynard University of Edinburgh 2002-04-24T10:25:52Z Submit this config ID and config itself to ILDG 2 replace Chris Maynard University of Edinburgh 2002-05-24T10:25:52Z config is replaced, config ID is unchanged 3/revision> replace Chris Maynard University of Edinburgh 2002-05-24T10:25:52Z config ID is replaced, config remains unchanged T3E-900 epcc Edinburgh Alpha processor UKQCD FORTRAN 16.8.3.1 1997-04-04T16:20:10Z Chris Maynard University of Edinburgh 2001-04-24T10:25:52Z .... single 2632843688 .... Note that I have changed . As is still mandatory, this move is done for aesthetical reasons. In an earlier phase of this project I would have accepted this without discussion. Concerning the new element : it is not good to replicate information within the schema. Furthermore, if this element is going to be mandatory, why not keeping the in the configuration XML mandatory. The submitter is only supposed to add an element with equal to add. This element is essentially equivalent to . Adding any further elements would be optional. I would accept dropping . ((D. Pleiter, 2005 Nov 21)) I understand that Balint proposes to make optional. If we accept this proposal, only action (XML element in this case ) will be mandatory, while add, replace, remove actions are optional. Another way to do is that we mandate generate action in , and other actions in are optional. I have no preference, as far as "time stamp and person" are mandatory. (( T. Yoshie, 2005 Nov 21 )) Sorry, in my previous comment I meant "generate" and not "add". Maybe I miss something, but it seems that ... and generate... are equivalent. If is mandatory (to be more precise: a list of non-zero length) then contributors have to insert only one element (I do not think that we should worry that this element could have unequal "generate"). No contributor is forced to extend this list when actions "add", "replace" or "remove" are performed. For this reason I think we should change the schema in the following way: - ensemble: is a list of optionally zero length - configuration: is a list with at least one element - ensemble+configuration: element is removed from managementActionType - configuration: revisionActionType is extended by action "replace" Element should remain where it is. ((D. Pleiter, 2005 Nov 21)) It is very good idea that we mandate at least one in the configuration . Shall we apply the same idea to the ensemble ? If we do so, element proposed above is not necessary and the schema will be backward compatible. ((T. Yoshie, 2005 Nov 22)) I disagree. It stops the tag from being optional. Which would put us back to exactly where we started from. For the sake of backward compatibility, I'll go along with an optional with the semantics as suggested by Dirk for both the ensemble and the configuration. As long as its optional and I can keep it out of the documents the user has to generate, and I don't have to touch the user generated metadata I'll remain reasonably happy. Likewise with Dirk's semantics, I don't have to touch the ensemble metadata when I add a config. That's great. However, for to be truly optional, must also become optional (as it essentially counts the number of items in ) This is not purely an aesthetic issue. I want to keep my archiveHistory subsytem entirely separate from my user submitted IDs. This reduces the harm I can do if I screw up with an update somewhere. It also decouples the archiveHistory from everything else. This will help me write simpler services which has recently become a big issue for me, since I am the one who has to write them all for my collaboration. It also means I can later incrementally add sophisticated logging and revision control without having an impact on the existing system. So I shift my position to the following compromise (moving towards what Dirk has suggested but not all the way): - ensemble: optional list of zero length - config : optional list of zero length - ensemble and config : optional (for backward compatibility with existing documents only) - crcChecksum to stay where it is (also for backward compatibility) If a configuration is questionable the relevant collaboration can be contacted from the tag in the ensembleID. On the issue of 'boiler plate information' which may be ambiguous (implementation, algorithm) I still disagree with Dirk on philosophical grounds as to whether it is a good thing that we now get people to record more information than they did previously. However, the need of backward compatibility forces me to accept that we shouldn't mess with it. ((B. Joo, 2005 Nov 21)) I disagree with Balint's proposal that is optional. We have to mandate particularly for config ID. Suppose that a contibutor replaced a configuration and he/she didn't change the config ID. How are users informed that the configuration is replaced? does not refer to the number of tags in . counter is increased by one, if configuration or ID is changed. (Submitter does not have to add anything to .) With this counter, users can notice that something is changed. I compromise as follows (this is the same as that I proposed before. Some are proposed by Balint.) - ensemble: optional list of zero length add new mandatory element in when we one uses the optional , archiveHistory/elem/revision is also optional - config : optional list of zero length add new mandatory element in has to include date and person information - ensemble and config : mandatory - crcChecksum to stay where it is (also for backward compatibility) I understand what bothers Balint. But we have to take care of user's point of view. (( T. Yoshie, 2005 Nov 22)) If the tag is mandatory, I cannot achieve my goal of not having to edit the user contributed XML, since I have to update the tag. In this case I give up completely and suggest that Dirk and Tomoteru's scheme of Nov 21 since it is more backward compatible than mandating new elements; - ensemble: is a list of optionally zero length - configuration: is a list with at least one element - ensemble+configuration: element is removed from managementActionType - configuration: revisionActionType is extended by action "replace" Element should remain where it is. - Dirk's semantics for archiveHistory actions (ie reference self only) At least the coupling between ensembleIDs and configurationIDs has been removed. (( B. Joo, 2005 Nov 22)) The was the biggest problem I had, but that has now gone, which makes things a little easier. The reason for our current difficulty is the definition of an ensemble. The definition as "a collection of gauge configurations" is fine until we realise that this is not static. Ideally the metadata would be non-mutable, but this is a problem when the definition of the ensemble changes. I think really our difficulty comes from this. We say this data is ensemble A. Now we change the data (i.e. extend the ensemble, or remove some cfgs or whatever) and now we want to be able to say, " Actually what we said was A is no longer true, what we now mean by A is this". This is where we have got into a mess. Really ensemble A is still A, but now we have A1 which is similar to A but has some changes. I don't think that the ensemble metadata is the right place for this. Currently our concept of an ensemble is not "this data" but "any data generated with these parameters". So we cannot generate non-mutable metadata describing this because it is going to change. We are attempting to construct a fudge, on one hand non-mutable metadata (i.e. what we say is A is A, which is not the same as saying we can add information about A, i.e. measurements), and keeping track of changes to an ensemble. How to reconcile these two ideas? I don't know, we are close to a fudge which may work, but is this the best solution? I can accept a fudge, but we are going to run into the same difficulties when we think about measurements. What about measurements on half the ensembles, or a later measurements suggests that the auto correlations are longer, and thus measurements are more widely separated, or if someone is binning the data, some is measuring every 5, rather than 10 trajectories. I think we should be clear about what the problem is, then decide if a fudge is OK, or if we need to tackle what the real problem is, the lack of a clear definition of an ensemble. I appreciate I am not proposing a solution here, but let's take a moments thought before we accept a fudge which may bite us later. ((C. Maynard Nov. 23)) I would like to suggest a different compromise: a) ensemble::: - list is allowed to be of zero length - element will be removed - element will be removed b) config::: - list contains at least one element which is supposed to have revisionAction="generated" - element will be removed - element will be removed c) ensemble::: - will be removed d) config::: - will be removed e) config::: - remains unchanged f) config::: - enumeration extended by 'replaced' I would like to explain why I suggest this compromise: - I think that Balint's request to change the schemata such that it is possible to avoid automatic changes of the user documents by the MDC is reasonable. - Once the stops to be mandatory (except for the information on when the configuration has been generated), the elements do not make so much sense anymore. - The recently suggested element is only important when searching for new ensembles. Given the rather small number of ensembles generated, I think a seperate anouncement mechanism (e.g. ILDG sessions during lattice conferences, status reports on ILDG workshops) would be more appropriate to provide this information. - These changes are clearly not backward compatible, but the changes are such that already existing documents can be made conform by deleting some elements. Needless to say that I sincerely hope that this is the last non-backward compatible change for a long period of time! ((D. Pleiter, Nov. 25)) I think that most of "removed" in Dirk's compromise should be "optional". I and my collaboration want to record some of them. Namely, I propose a) ensemble::: - list is allowed to be of zero length - element will be optinal - element will be removed b) config::: - list contains at least one element with revisionAction="generate" - element will be optional - element does not exist c) ensemble::: - will be optional d) config::: - will be optional e) config::: - remains unchanged f) config::: - optional enumeration extended by 'replace' 'replace' means either ID or config (I also replaced "generated" -> "generate", "replaced" -> "replace". If all of you agree with this compromise, I also agree, although it is quite unsatisfactory. In case that optional "replace" action is not recorded in config XML, users cannot detect replacement of configurations (without downloading configuration itself). I'd like to ask you to remember and accept this weak point, before you agree with this compromise. ((T. Yoshie, Nov. 26)) I agree with what Dirk is suggesting. To address Tomoteru's concern I ask that you consider the following: If the data part of the configuration changes the crcChecksum will change to reflect this. If the LFN in the config changes the LFN in the ID document changes to reflect this. If the Lattice size changes the ensemble ID will change to reflect this. Precision is also recorded in the metadata. I am not sure off the top of my head about endianness (but both endianness and precision changes would change the crcChecksum). So basically any change to the config should be accompanied by a corresponding XML ID replacement. However, a change in the document cannot be detected without communication. Either the user has to check back in the database to see if his/her copy is still current, or the database needs to inform the user (through say a registered email) that a document hsa changed. The revisions tag alone cannot solve this. Given that other things will change, the revisions tag is redundant I think. So I think Dirk's scheme is a good one, and I very happily agree to this latest compromise. ((B. Joo, Nov 26) Request for an optional tag in ensemble metadata document ... ... MILC_COARSE_ENSEMBLE ... Purpose and motivation: The label comes from the potential desire of a collaboration to annotate their ensemble. The context: MILC has worked hard to create several ensembles with different input parameters but roughly the same lattice spacing. They then have a classifications of their ensembles as being COARSE, FINE, SUPERFINE etc. The purpose of the ensemble label is to mark the ensemble as belonging to one of these classes. It is entirely separate from the issue of observables and accurate definitions of lattice spacings, and is similar in spirit to a CVS tag. The ensemble label would not necessarily have any meaning to anyone outside the generating collaboration. Since not all collaborations may wish to use this label feature and also for backward compatibility it should be optional (minOccurs=0). The content of the tag would be a string containing no whitespaces (ie a single word). ((B.Joo, Dec 1,05))